⏺ 🎯 ULTIMATE SITE PARSER OVERVIEW

The Ultimate Site Import → Edit → Export Pipeline

🌊 The Complete Flow

┌─────────────────────────────────────────────────────────────────┐ │ ULTIMATE SITE PARSER │ │ Import ANY website → Convert to ACCID format │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ ACCID HTML BUILDER │ │ Edit visually → Add content → Update metadata │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ STATIC HTML EXPORT │ │ ZIP download → FTP upload → Fully portable sites │ └─────────────────────────────────────────────────────────────────┘

1️⃣ ULTIMATE SITE PARSER – The Importer

What It Does

Imports ANY website and converts it to editable ACCID format

Core Components

A. Content Discovery (core.py:30-130)

Finds ALL pages on a site using multiple methods:

METHOD 1: Sitemap XML
  sitemap_urls = get_sitemap_urls()  # Parses sitemap.xml, handles indexes
Result: 100+ URLs from sitemap
METHOD 2: WordPress REST API
  api_urls = get_wordpress_api_urls()  # /wp-json/wp/v2/posts, pages
Result: All posts/pages with titles
METHOD 3: RSS/Atom Feeds
  feed_urls = get_feed_urls()  # RSS, Atom feeds
Result: Blog posts with metadata
FILTER: Remove junk
  valid_urls = [url for url in all_urls if is_valid_content_url(url)]
Removes: /category/, /tag/, /author/, /feed/, etc.
  Result: Clean list of actual content pages
  B. Framework Detection (core.py:429-496)
  Automatically detects 20+ platforms:
  if 'elementor' in html:      framework = 'wordpress-elementor'  elif 'et_pb_' in html:      framework = 'wordpress-divi'  elif 'wix.com' in html:      framework = 'wix'  elif 'NEXT_DATA' in html:      framework = 'nextjs'

… 16+ more platforms

Supports:

WordPress builders: Elementor, Divi, WPBakery, Beaver, Bricks, Gutenberg
SaaS platforms: Wix, Squarespace, Weebly, Shopify, Webflow
JS frameworks: Next.js, React, Vue, Angular, Gatsby, Astro
Other CMS: Joomla, Drupal
Legacy: FrontPage, Flash
Fallback: Generic HTML

C. Content Cleaning (core.py:322-401)

Strips all junk before parsing:

Remove tracking scripts

 for script in soup.find_all('script'):      if 'google-analytics' in src or 'facebook.net' in src:          script.decompose()

kie-notices-chat-widgets\’>Remove ads, social widgets, cookie notices, chat widgets

Result: Pure content only

D. Parser System (parsers/.py)

Routes to framework-specific parsers:

Registry-based system

pre> @register_parser("wordpress-elementor") def parse_elementor(soup): modules = [] for section in soup.find_all(class_='elementor-section'): # Extract content # Create ACCID modules modules.append({ 'id': 'text-123-abc', 'type': 'text', 'content': '

...

', 'layoutClass': 'body-text' }) return modules

Parsers available:

generic.py – Fallback for any HTML
wordpress_elementor.py – Elementor builder
wordpress_divi.py – Divi builder
wordpress_gutenberg.py – Gutenberg blocks
wix.py – Wix sites
shopify.py – Shopify stores
squarespace.py – Squarespace sites
react_like.py – React/Next.js apps
- more

E. Metadata Extraction (meta_extractor.py)

Extracts ALL SEO data from original site:

def extract_page_meta(url, html, framework): pre> soup = BeautifulSoup(html, ‘html.parser’)

  meta = {      'seo': {          'title': extract_meta('og:title') or extract_title(),          'description': extract_meta('description')      },      'author': extract_meta('author'),      'date': extract_meta('date') or extract_from_schema(),      'excerpt': extract_meta('description'),      'tags': extract_keywords(),      'categories': extract_from_schema(),      'featuredImage': extract_meta('og:image')  }  return meta

Extracts:

SEO title & description
Author & publication date
Keywords → tags
Categories (from structured data)
Featured images (Open Graph)
All existing SEO preserved!

F. Scam Detection (scam_detection.py)

BONUS: Analyzes sites for red flags:

class ScamDetector: def analyze_site(url, html, framework): # Check domain age (< 30 days = critical) # Check contact info (none = critical) # Check SSL certificate (HTTP = high risk) # Check payment methods (wire transfer = critical) # Check testimonials (stock photos = high risk) # Check legal pages (no privacy policy = high)

      return {          'verdict': 'LIKELY SCAM' | 'SUSPICIOUS' | 'LEGITIMATE',          'scam_score': 0-100,          'flags': [...]      }

Protects users from importing scam sites!

G. Export to ACCID Format (core.py:522-615)

   project = {      'pages': {          'home': {              'title': 'Home Page',              'slug': 'index.html',              'modules': [                  {                      'id': 'text-123',                      'type': 'text',                      'content': 'Welcome
',                      'layoutClass': 'heading-xl'                  },                  {                      'id': 'image-456',                      'type': 'image',                      'src': 'https://...',                      'alt': 'Hero image'                  }              ],              'meta': {                  'seo': { 'title': '...', 'description': '...' },                  'author': 'Original Author',                  'date': '2024-01-15',                  'tags': ['keyword1', 'keyword2'],                  'categories': ['Category A'],                  'featuredImage': 'https://...'              }          },          'about': { ... },          'contact': { ... }      },      'metadata': {          'framework': 'wordpress-elementor',          'total_pages': 15,          'imported_at': '2025-01-15T10:30:00'      }  }
Save to JSON
  with open('imported_site.json', 'w') as f:      json.dump(project, f, indent=2)
  Facebook Export Parser (facebook_export.py)
  BONUS: Import Facebook data exports!
  class FacebookExportParser:      def parse_full_export(export_folder):          # Parse posts.json → Timeline          # Parse photos/ → Gallery          # Parse comments.json → Engagement stats          # Parse friends.json → Network
      return {          'pages': {              'timeline': { modules: [...] },              'photos': { modules: [...] },              'stats': { modules: [...] }          }      }

Python




  Creates 3 pages from Facebook export:


Timeline - Chronological posts
Photo Gallery - All photos
Stats & Insights - Engagement metrics


  2️⃣ ACCID HTML BUILDER - The Editor
  What It Does
  Visual editor for imported (or new) sites with professional tools
  Core Components
  A. Page Management (creator/page-manager.js)

  Multi-page website builder:


   class PageManager {      pages = {          'index': {              id: 'index',              name: 'Home',              slug: 'index.html',              title: 'Home Page',              modules: [...],  // Content modules              meta: {...}      // SEO metadata          },          'about': { ... },          'contact': { ... }      }  // Switch between pages  switchToPage(pageId) { ... }  // Add new pages  createNewPage() { ... }  // Delete pages  deletePage() { ... }

Python




  }

  B. Module System (modules/.js)

  Drag-and-drop content blocks:

  Available Modules:


Text - Rich text with formatting
Image - Images with captions
Hero - Full-width headers with backgrounds
Gallery - Photo grids (timeline, mystery-dots, grid layouts)

Button - Call-to-action buttons
\





Python

METHOD 1: Sitemap XML

Result: 100+ URLs from sitemap

METHOD 2: WordPress REST API

Result: All posts/pages with titles

METHOD 3: RSS/Atom Feeds

Result: Blog posts with metadata

FILTER: Remove junk

Removes: /category/, /tag/, /author/, /feed/, etc.

… 16+ more platforms

Remove tracking scripts

Remove nav/header/footer, sidebars, comments, related posts

Result: Pure content only

Registry-based system

Welcome

Save to JSON

Related Posts