Ultimate Site Parser

⏺ 🎯 ULTIMATE SITE PARSER OVERVIEW

The Ultimate Site Import β†’ Edit β†’ Export Pipeline


🌊 The Complete Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ULTIMATE SITE PARSER β”‚ β”‚ Import ANY website β†’ Convert to ACCID format β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ACCID HTML BUILDER β”‚ β”‚ Edit visually β†’ Add content β†’ Update metadata β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STATIC HTML EXPORT β”‚ β”‚ ZIP download β†’ FTP upload β†’ Fully portable sites β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


1️⃣ ULTIMATE SITE PARSER – The Importer

What It Does

Imports ANY website and converts it to editable ACCID format

Core Components

A. Content Discovery (core.py:30-130)

Finds ALL pages on a site using multiple methods:

METHOD 1: Sitemap XML

sitemap_urls = get_sitemap_urls() # Parses sitemap.xml, handles indexes

Result: 100+ URLs from sitemap

METHOD 2: WordPress REST API

api_urls = get_wordpress_api_urls() # /wp-json/wp/v2/posts, pages

Result: All posts/pages with titles

METHOD 3: RSS/Atom Feeds

feed_urls = get_feed_urls() # RSS, Atom feeds

Result: Blog posts with metadata

FILTER: Remove junk

valid_urls = [url for url in all_urls if is_valid_content_url(url)]

Removes: /category/, /tag/, /author/, /feed/, etc.

Result: Clean list of actual content pages

B. Framework Detection (core.py:429-496)

Automatically detects 20+ platforms:

if 'elementor' in html: framework = 'wordpress-elementor' elif 'et_pb_' in html: framework = 'wordpress-divi' elif 'wix.com' in html: framework = 'wix' elif 'NEXT_DATA' in html: framework = 'nextjs'

Python

… 16+ more platforms

Supports:

  • WordPress builders: Elementor, Divi, WPBakery, Beaver, Bricks, Gutenberg
  • SaaS platforms: Wix, Squarespace, Weebly, Shopify, Webflow
  • JS frameworks: Next.js, React, Vue, Angular, Gatsby, Astro
  • Other CMS: Joomla, Drupal
  • Legacy: FrontPage, Flash
  • Fallback: Generic HTML

C. Content Cleaning (core.py:322-401)

Strips all junk before parsing:

Remove tracking scripts

 for script in soup.find_all('script'):      if 'google-analytics' in src or 'facebook.net' in src:          script.decompose()
Python

kie-notices-chat-widgets\’>Remove ads, social widgets, cookie notices, chat widgets

Remove nav/header/footer, sidebars, comments, related posts

Result: Pure content only

D. Parser System (parsers/.py)

Routes to framework-specific parsers:

Registry-based system

pre> @register_parser("wordpress-elementor") def parse_elementor(soup): modules = [] for section in soup.find_all(class_='elementor-section'): # Extract content # Create ACCID modules modules.append({ 'id': 'text-123-abc', 'type': 'text', 'content': '


...


', 'layoutClass': 'body-text' }) return modules

Parsers available:

  • generic.py – Fallback for any HTML

  • wordpress_elementor.py – Elementor builder

  • wordpress_divi.py – Divi builder

  • wordpress_gutenberg.py – Gutenberg blocks

  • wix.py – Wix sites

  • shopify.py – Shopify stores

  • squarespace.py – Squarespace sites

  • react_like.py – React/Next.js apps

    • more

E. Metadata Extraction (meta_extractor.py)

Extracts ALL SEO data from original site:

def extract_page_meta(url, html, framework): pre> soup = BeautifulSoup(html, β€˜html.parser’)

  meta = {      'seo': {          'title': extract_meta('og:title') or extract_title(),          'description': extract_meta('description')      },      'author': extract_meta('author'),      'date': extract_meta('date') or extract_from_schema(),      'excerpt': extract_meta('description'),      'tags': extract_keywords(),      'categories': extract_from_schema(),      'featuredImage': extract_meta('og:image')  }  return meta

Extracts:

  • SEO title & description
  • Author & publication date
  • Keywords β†’ tags
  • Categories (from structured data)
  • Featured images (Open Graph)
  • All existing SEO preserved!

F. Scam Detection (scam_detection.py)

BONUS: Analyzes sites for red flags:

class ScamDetector: def analyze_site(url, html, framework): # Check domain age (< 30 days = critical) # Check contact info (none = critical) # Check SSL certificate (HTTP = high risk) # Check payment methods (wire transfer = critical) # Check testimonials (stock photos = high risk) # Check legal pages (no privacy policy = high)

      return {          'verdict': 'LIKELY SCAM' | 'SUSPICIOUS' | 'LEGITIMATE',          'scam_score': 0-100,          'flags': [...]      }

Protects users from importing scam sites!

G. Export to ACCID Format (core.py:522-615)

   project = {      'pages': {          'home': {              'title': 'Home Page',              'slug': 'index.html',              'modules': [                  {                      'id': 'text-123',                      'type': 'text',                      'content': '

Welcome

'
, 'layoutClass': 'heading-xl' }, { 'id': 'image-456', 'type': 'image', 'src': 'https://...', 'alt': 'Hero image' } ], 'meta': { 'seo': { 'title': '...', 'description': '...' }, 'author': 'Original Author', 'date': '2024-01-15', 'tags': ['keyword1', 'keyword2'], 'categories': ['Category A'], 'featuredImage': 'https://...' } }, 'about': { ... }, 'contact': { ... } }, 'metadata': { 'framework': 'wordpress-elementor', 'total_pages': 15, 'imported_at': '2025-01-15T10:30:00' } }

Save to JSON

with open('imported_site.json', 'w') as f: json.dump(project, f, indent=2)

Facebook Export Parser (facebook_export.py)

BONUS: Import Facebook data exports!

class FacebookExportParser: def parse_full_export(export_folder): # Parse posts.json β†’ Timeline # Parse photos/ β†’ Gallery # Parse comments.json β†’ Engagement stats # Parse friends.json β†’ Network

      return {          'pages': {              'timeline': { modules: [...] },              'photos': { modules: [...] },              'stats': { modules: [...] }          }      }
Python

Creates 3 pages from Facebook export:

  1. Timeline - Chronological posts
  2. Photo Gallery - All photos
  3. Stats & Insights - Engagement metrics

2️⃣ ACCID HTML BUILDER - The Editor

What It Does

Visual editor for imported (or new) sites with professional tools

Core Components

A. Page Management (creator/page-manager.js)

Multi-page website builder:

   class PageManager {      pages = {          'index': {              id: 'index',              name: 'Home',              slug: 'index.html',              title: 'Home Page',              modules: [...],  // Content modules              meta: {...}      // SEO metadata          },          'about': { ... },          'contact': { ... }      }  // Switch between pages  switchToPage(pageId) { ... }  // Add new pages  createNewPage() { ... }  // Delete pages  deletePage() { ... }
Python

}

B. Module System (modules/.js)

Drag-and-drop content blocks:

Available Modules:

  • Text - Rich text with formatting
  • Image - Images with captions
  • Hero - Full-width headers with backgrounds
  • Gallery - Photo grids (timeline, mystery-dots, grid layouts)
  • Button - Call-to-action buttons
  • \

Python