ACCID Scraper Pipeline – scribbled.space

+ JSON-noculars

Complete System Documentation

Version 2.0 — February 11, 2026


{
  "document_meta": {
    "title": "ACCID Scraper Pipeline + JSON-noculars — Complete System Documentation",
    "version": "2.0",
    "date": "2026-02-11",
    "author": "Shawn + Claude",
    "purpose": "Comprehensive reference for any Claude instance working on the ACCID scraper pipeline, JSON-noculars tool, or HTML Builder data integration. Covers architecture, every pipeline step, data formats, converter mapping rules, known gotchas, and the scope library system.",
    "related_systems": [
      "ACCID HTML Builder (htmlbuilder_local)",
      "AccidSaver (3-tier persistence)",
      "AccidDataAdapters (data normalization)",
      "AccidTemplateReplicator (layout duplication)",
      "tagger_playwright_local (scraper toolkit)"
    ]
  },

  "system_overview": {
    "what_accid_scraper_does": "Takes any website, extracts its content (text, links, images, navigation, etc.), and converts it into a format the ACCID HTML Builder can import. The output is a static HTML site with the original content, no databases, no APIs, no frameworks. The user gets their content back in a form that loads fast, costs nothing to host, and never breaks.",
    "what_it_is_NOT": "This is NOT a visual clone tool. We do not recreate the source site's layout, fonts, animations, or design. We extract CONTENT so the user can start fresh with it in the HTML Builder. Think of it as a content liberation tool.",
    "value_proposition": "Your content is trapped in WordPress behind PHP, MySQL, themes, plugins, and a hosting bill. We pull it all out, hand it back as clean static HTML. It loads in 50ms, costs nothing, never breaks, never needs updating. Plus you get tag-based search and dynamic category pages for free.",
    "entry_point": "./start_stepped.sh (bash script in tagger_playwright_local/)",
    "json_noculars": "Visual HTML/JSON data inspection tool that lets you see what data lives inside HTML elements or JSON structures before writing extraction rules. Available as a standalone HTML file served locally on port 9860."
  },

  "architecture": {
    "directory_structure": {
      "tagger_playwright_local/": "Root of the scraper pipeline",
      "tagger_playwright_local/start_stepped.sh": "Main entry point — routes commands to Python scripts",
      "tagger_playwright_local/json-noculars.html": "Visual data inspector (self-contained React app)",
      "tagger_playwright_local/venv/": "Python virtual environment (auto-created)",
      "tagger_playwright_local/jobs/": "Per-site job folders",
      "tagger_playwright_local/jobs/{site_name}/": "One folder per scrape target",
      "tagger_playwright_local/jobs/{site_name}/config.json": "Project configuration",
      "tagger_playwright_local/jobs/{site_name}/urls.json": "Discovered URLs with page types",
      "tagger_playwright_local/jobs/{site_name}/audit.json": "Site complexity analysis",
      "tagger_playwright_local/jobs/{site_name}/auth/": "Login session state (if needed)",
      "tagger_playwright_local/jobs/{site_name}/html/": "Raw fetched HTML files per page",
      "tagger_playwright_local/jobs/{site_name}/screenshots/": "Page screenshots for QA",
      "tagger_playwright_local/jobs/{site_name}/preview/": "Preview screenshots + Excel report",
      "tagger_playwright_local/jobs/{site_name}/tags.json": "CSS selector → tag name mappings",
      "tagger_playwright_local/jobs/{site_name}/scopes.json": "Extraction refinement rules",
      "tagger_playwright_local/jobs/{site_name}/extracted.json": "Structured extracted content",
      "tagger_playwright_local/jobs/{site_name}/extraction_report.json": "Extraction QA results",
      "tagger_playwright_local/jobs/{site_name}/htmlbuilder_import.json": "Builder-ready module data",
      "tagger_playwright_local/jobs/{site_name}/tag_groups.json": "Flattened data for streams panel",
      "tagger_playwright_local/jobs/{site_name}/accid_dropper.json": "Final builder import file",
      "tagger_playwright_local/jobs/{site_name}/mapping.json": "Tag → module type mappings",
      "tagger_playwright_local/jobs/{site_name}/conversion_report.json": "Conversion QA results",
      "tagger_playwright_local/action-sets/": "Custom Playwright action configurations"
    },

    "dependencies": {
      "python": ["playwright", "beautifulsoup4", "lxml", "openpyxl", "pillow"],
      "system": ["python3", "python3-venv", "chromium (via playwright)"],
      "json_noculars": ["React 18 (CDN)", "ReactDOM 18 (CDN)", "Babel standalone (CDN)"],
      "note": "All Python deps auto-installed by start_stepped.sh via venv. JSON-noculars has zero install — just needs a browser."
    },

    "python_scripts": {
      "setup.py": "Project wizard — creates job folder, asks for URL, shows path diagram",
      "discover.py": "URL crawler — sitemap parsing, link following, deduplication, platform detection",
      "audit.py": "NOPE detector — complexity scoring, SPA detection, auth detection, CDN analysis",
      "login.py": "Auth capture — opens visible browser for manual login, saves session state",
      "curate.py": "URL classifier — interactive CLI for assigning page types, bulk pattern skipping",
      "runner.py": "Page fetcher — Playwright headless browser, action sets, JS rendering, lazy load handling",
      "preview.py": "Visual QA — screenshots, selector highlighting, Excel report generation",
      "tagger.py": "Selector mapper — browser-based visual element selection, tag name assignment",
      "extract.py": "Content extractor — applies tags + scopes to HTML, outputs structured data",
      "convert.py": "Module converter — maps extracted data to HTML Builder module format",
      "validate.py": "Multi-mode QA — extraction validation, conversion validation, round-trip comparison",
      "status.py": "Progress checker — shows which steps are complete, suggests next action",
      "vault_export.py": "WordPress CMS exporter — categories, tags, archives, taxonomies"
    }
  },

  "pipeline": {
    "total_steps": 15,
    "total_phases": 6,
    "total_gates": 3,
    "noculars_integration_points": 5,

    "phases": [
      {
        "phase": "RECON",
        "number": 1,
        "description": "Understand the target before touching it",
        "steps": [
          {
            "step": 1,
            "command": "setup",
            "title": "Project Wizard",
            "what": "Name → URL → path diagram → access level",
            "details": "Creates /jobs/name_com/ directory structure. Asks for site URL, expected page count, whether auth is needed. Two paths: FULL ACCESS (have admin/API/DB → skip to wp-export) or SCRAPE ONLY (public pages → full pipeline).",
            "inputs": "User interaction (CLI prompts)",
            "outputs": ["jobs/{site_name}/config.json"],
            "python_script": "setup.py",
            "cli": "./start.sh setup",
            "gate": false,
            "optional": false,
            "noculars": null
          },
          {
            "step": 2,
            "command": "discover",
            "title": "Crawl & Map URLs",
            "what": "Find all URLs via sitemap.xml, robots.txt, and link crawling",
            "details": "Checks sitemap first (fast, reliable). Falls back to recursive link crawling. Deduplicates. Reports total page count and URL patterns. Detects platform (WordPress, Shopify, Squarespace, Wix, etc.).",
            "inputs": "Website URL",
            "outputs": ["jobs/{site_name}/urls.json", "jobs/{site_name}/sitemap_raw.json"],
            "python_script": "discover.py",
            "cli": "./start.sh discover https://example.com --max-pages=50",
            "cli_flags": {
              "--max-pages=N": "Limit URL discovery (good for testing with 20)",
              "URL": "The target website URL"
            },
            "gate": false,
            "optional": false,
            "noculars": null,
            "tips": ["Start with --max-pages=20 to test, then re-run without limit"]
          },
          {
            "step": 3,
            "command": "audit",
            "title": "NOPE Detector",
            "what": "Asset analysis — images, CDNs, trackers, JS frameworks, complexity score",
            "details": "Fetches sample pages. Detects: WordPress/Wix/Squarespace/Shopify, CDN-hosted images (will they hotlink?), JS-rendered content (SPA?), auth walls, cookie banners. Outputs a complexity score 1-10 and a verdict.",
            "inputs": ["jobs/{site_name}/urls.json"],
            "outputs": ["jobs/{site_name}/audit.json"],
            "python_script": "audit.py",
            "cli": "./start.sh audit jobs/example_com/",
            "cli_flags": {
              "--full": "Deep analysis of all discovered URLs instead of sample"
            },
            "gate": true,
            "gate_rule": "If complexity > 7 or SPA detected, stop and assess. If auth required, run login step before runner.",
            "verdicts": {
              "PROCEED": "Score 1-4. Standard site, pipeline will work.",
              "CAUTION": "Score 5-7. Some issues, may need login or extra config.",
              "NOPE": "Score 8-10. SPA/heavy JS, significant manual work needed."
            },
            "optional": false,
            "noculars": null
          },
          {
            "step": 4,
            "command": "login",
            "title": "Auth Capture",
            "what": "Capture authenticated session — cookies, localStorage, tokens",
            "details": "Opens Playwright browser (visible, not headless). User logs in manually — handles 2FA, CAPTCHA, whatever. Captures cookies + storage state. Runner reuses this session automatically. Skip if audit says no auth needed.",
            "inputs": ["User manual login"],
            "outputs": ["jobs/{site_name}/auth/storage_state.json"],
            "python_script": "login.py",
            "cli": "./start.sh login jobs/example_com/ --url https://site.com/login",
            "gate": false,
            "optional": true,
            "optional_condition": "Only needed if audit detects auth requirements",
            "noculars": null
          }
        ]
      },
      {
        "phase": "CAPTURE",
        "number": 2,
        "description": "Get the raw HTML — every page, fully rendered",
        "steps": [
          {
            "step": 5,
            "command": "curate",
            "title": "Review & Classify URLs",
            "what": "Assign page types (homepage, post, product, gallery, etc.)",
            "details": "Interactive CLI shows URL list grouped by pattern. User assigns types: 'these 47 /blog/* URLs are posts, this is the homepage, these are gallery pages.' Types drive which scope patterns apply later. Skip URLs you don't want (admin, login, search). Apply bulk patterns (e.g., skip all /tag/* URLs).",
            "inputs": ["jobs/{site_name}/urls.json"],
            "outputs": ["jobs/{site_name}/urls.json (updated with page_type field)"],
            "python_script": "curate.py",
            "cli": "./start.sh curate jobs/example_com/",
            "page_types": ["homepage", "post", "page", "category", "tag", "gallery", "product", "archive", "search", "other"],
            "gate": false,
            "optional": false,
            "noculars": null,
            "why_types_matter": "Different page types have different HTML structure. A 'post' has .article-body, a 'homepage' has .hero. Types determine which tags/scopes apply."
          },
          {
            "step": 6,
            "command": "runner",
            "title": "Fetch All Pages",
            "what": "Playwright fetches every URL — handles JS rendering, cookies, lazy loading",
            "details": "Uses curated URL list. Waits for network idle. Scrolls to trigger lazy images. Saves raw HTML + screenshot per page. Uses auth state if login step was done. Respects rate limits.",
            "inputs": ["jobs/{site_name}/urls.json", "jobs/{site_name}/auth/storage_state.json (optional)"],
            "outputs": ["jobs/{site_name}/fetched.json", "jobs/{site_name}/html/*.html"],
            "python_script": "runner.py",
            "cli": "./start.sh runner jobs/example_com/ --all",
            "cli_flags": {
              "--all": "Fetch all curated URLs",
              "--type=post": "Only fetch pages of a specific type",
              "--action-set=article": "Use a specific action set (click read-more, expand accordions)",
              "--no-auth": "Ignore saved auth state"
            },
            "action_sets": {
              "basic": "Just load page, wait for network idle",
              "homepage": "Extra wait, scroll to trigger lazy images",
              "article": "Click 'read more', expand accordions",
              "custom": "Create your own in action-sets/*.json"
            },
            "gate": false,
            "optional": false,
            "noculars": null
          },
          {
            "step": 7,
            "command": "preview",
            "title": "Visual QA",
            "what": "Screenshots + Excel report — verify we got real content",
            "details": "Generates screenshots of sample pages. Highlights what each selector matches. Creates Excel report with embedded screenshots. Shows exactly what will be extracted.",
            "inputs": ["jobs/{site_name}/html/*.html", "jobs/{site_name}/tags.json (if exists)"],
            "outputs": ["jobs/{site_name}/preview/report.xlsx", "jobs/{site_name}/preview/screenshots/"],
            "python_script": "preview.py",
            "cli": "./start.sh preview jobs/example_com/ --sample=5",
            "gate": true,
            "gate_rule": "Review screenshots. If pages are blank/blocked, re-run runner with login or adjust settings. Do NOT proceed with empty captures.",
            "optional": false,
            "noculars": null
          }
        ]
      },
      {
        "phase": "SCOPE",
        "number": 3,
        "description": "Define what to extract — human judgment, power tools",
        "steps": [
          {
            "step": 8,
            "command": "tagger",
            "title": "Tag Elements",
            "what": "Map CSS selectors to tag names (nav, hero, content, gallery, footer)",
            "details": "Opens browser for visual element inspection. You provide tag name and CSS selector. Tags can be GLOBAL (all pages) or PAGE-TYPE specific. This is WHERE you click — JSON-noculars shows you WHAT'S INSIDE what you click.",
            "inputs": ["jobs/{site_name}/html/*.html"],
            "outputs": ["jobs/{site_name}/tags.json"],
            "python_script": "tagger.py",
            "cli": "./start.sh tagger jobs/example_com/",
            "common_tag_names": ["nav", "header", "hero", "content", "sidebar", "footer", "cards", "gallery", "meta"],
            "gate": false,
            "optional": false,
            "noculars": "USE_HERE",
            "noculars_how": "After tagging, paste sample page HTML into JSON-noculars (HTML mode). Hover elements to see: does this nav have 8 links? Does this hero have an image + headline + CTA? Verify BEFORE moving on."
          },
          {
            "step": 9,
            "command": "scopes",
            "title": "Scope & Clean",
            "what": "Write scope.json — inner_selector, strip_selectors, regex cleanup",
            "details": "For each tag from step 8, define how to drill in and what to strip. This is the cleanup pass — remove share buttons, newsletter popups, related posts, ad containers. Scope library patterns speed this up.",
            "inputs": ["jobs/{site_name}/tags.json", "jobs/{site_name}/html/*.html"],
            "outputs": ["jobs/{site_name}/scopes.json"],
            "cli": "./start.sh scopes jobs/example_com/",
            "gate": false,
            "optional": false,
            "noculars": "PRIMARY",
            "noculars_how": "This IS the JSON-noculars workflow. The scopes command starts a local server on port 9860, opens json-noculars.html in the browser with a sample HTML file pre-loaded. User visually builds scope.json: paste HTML → X-ray → click → name tag → Add to Scope → Download scopes.json.",
            "noculars_server": {
              "command": "python3 -m http.server 9860 --directory $BASE_DIR",
              "url_pattern": "http://localhost:9860/tagger_playwright_local/json-noculars.html?job={job_dir}&file=html/{sample}.html",
              "auto_opens_browser": true
            },
            "scope_json_format": {
              "description": "Each key is a tag name from tags.json. Values define how to refine extraction.",
              "example": {
                "nav": {
                  "inner_selector": "ul.menu a",
                  "strip_selectors": ["button.toggle", "[aria-hidden]"],
                  "html_regex_cleanup": []
                },
                "content": {
                  "inner_selector": "article.post-body",
                  "strip_selectors": [".share-buttons", "[class*='newsletter']", "[class*='related']", ".sidebar"],
                  "html_regex_cleanup": [
                    { "pattern": "<div[^>]*class=[\"'][^\"']*(?:promo|ad)[^\"']*[\"'][^>]*>[\\s\\S]*?</div>", "replacement": "" }
                  ]
                }
              },
              "fields": {
                "inner_selector": "CSS selector to drill into. Tags the outer element, this targets the useful part inside.",
                "strip_selectors": "Array of CSS selectors to remove before extraction. Share buttons, newsletter forms, related posts, ad containers.",
                "html_regex_cleanup": "Array of {pattern, replacement} objects. Regex applied to raw HTML after selector work. Nuclear option for stubborn cruft."
              }
            },
            "philosophy": "Tag wide, scope tight. Tag the <nav> element, scope to just the <a> links inside it."
          }
        ]
      },
      {
        "phase": "EXTRACT",
        "number": 4,
        "description": "Apply scopes to raw HTML → structured data",
        "steps": [
          {
            "step": 10,
            "command": "extract",
            "title": "Pull Structured Content",
            "what": "Apply selectors + scopes to every page → extracted.json",
            "details": "For each page: load HTML → apply tag selectors → apply inner_selector → apply strip_selectors → apply regex cleanup → extract text, html, links, images per tag. Outputs clean structured data.",
            "inputs": ["jobs/{site_name}/html/*.html", "jobs/{site_name}/tags.json", "jobs/{site_name}/scopes.json"],
            "outputs": ["jobs/{site_name}/extracted.json"],
            "python_script": "extract.py",
            "cli": "./start.sh extract jobs/example_com/ --clean",
            "cli_flags": {
              "--clean": "Strip whitespace and normalize output"
            },
            "gate": false,
            "optional": false,
            "noculars": "VERIFY",
            "noculars_how": "After extraction, load extracted.json into JSON-noculars (JSON mode). X-ray the tree: does 'nav' have 8 links with text+href? Does 'content' have clean text without ad cruft? If wrong, go back to step 9.",
            "extracted_data_format": {
              "description": "Each page is an object with tag names as keys. Each tag value contains the extracted data.",
              "known_gotcha": "Extracted values may be nested objects {text, html, links, images} instead of plain strings. This is intentional — it preserves all data. The CONVERTER (step 12) must flatten these correctly or they render as [object Object] in the builder.",
              "example": {
                "url": "https://example.com/about",
                "page_type": "page",
                "nav": {
                  "text": "welcome legend vessel gallery",
                  "html": "<ul><li><a href='/'>welcome</a></li>...</ul>",
                  "links": [
                    { "text": "welcome", "href": "/" },
                    { "text": "legend", "href": "/legend" }
                  ],
                  "images": []
                },
                "content": {
                  "text": "Captain Jerry has been sailing the Pacific for over 20 years.",
                  "html": "<p>Captain Jerry has been sailing...</p>",
                  "links": [
                    { "text": "Mexico", "href": "/route/mexico" }
                  ],
                  "images": [
                    { "src": "/photos/jerry.jpg", "alt": "Captain Jerry" }
                  ]
                }
              }
            }
          },
          {
            "step": 11,
            "command": "validate --extract",
            "title": "Extraction QA",
            "what": "Automated checks — missing tags, empty fields, [object Object], unexpected nesting",
            "details": "Walks all extracted data. Flags: tags that found 0 matches, text fields that are empty, values that are objects instead of strings, images with broken/relative URLs, pages missing expected tags.",
            "inputs": ["jobs/{site_name}/extracted.json"],
            "outputs": ["jobs/{site_name}/extraction_report.json"],
            "python_script": "validate.py --extract",
            "cli": "./start.sh validate jobs/example_com/ --extract",
            "gate": true,
            "gate_rule": "If >10% of pages have issues, fix scopes (step 9) and re-extract. Don't proceed with dirty data.",
            "optional": false,
            "noculars": "DEBUG",
            "noculars_how": "When validation flags issues, load the problem page's data into JSON-noculars (JSON mode) to see exactly what went wrong. Then load the raw HTML in HTML mode to figure out why the scope missed it.",
            "checks_performed": [
              "Tags with 0 matches across all pages",
              "Text fields that are empty strings",
              "Values that are objects instead of strings ([object Object] risk)",
              "Images with relative URLs that need base URL prepended",
              "Images with broken or placeholder URLs",
              "Pages missing tags that other pages of the same type have",
              "Suspiciously short content (< 50 chars for content tags)",
              "Cruft indicators (share buttons, newsletter text still present)"
            ]
          }
        ]
      },
      {
        "phase": "CONVERT",
        "number": 5,
        "description": "Transform extracted data into builder-ready format",
        "steps": [
          {
            "step": 12,
            "command": "convert",
            "title": "Map Data → Modules",
            "what": "Generate htmlbuilder_import.json + tag_groups.json",
            "details": "This is the bridge between scraper and builder. Maps extracted tags to HTML Builder module types. Handles data flattening to prevent [object Object]. Assembles pages with correct module order and layout classes.",
            "inputs": ["jobs/{site_name}/extracted.json", "jobs/{site_name}/tags.json"],
            "outputs": [
              "jobs/{site_name}/htmlbuilder_import.json",
              "jobs/{site_name}/tag_groups.json",
              "jobs/{site_name}/accid_dropper.json",
              "jobs/{site_name}/mapping.json"
            ],
            "python_script": "convert.py",
            "cli": "./start.sh convert jobs/example_com/ --tag-groups",
            "cli_flags": {
              "--tag-groups": "Also generate tag_groups.json for data streams panel",
              "--interactive": "Step through each mapping decision interactively"
            },
            "gate": false,
            "optional": false,
            "noculars": "VERIFY",
            "noculars_how": "Load htmlbuilder_import.json into JSON-noculars (JSON mode). Verify each page has the right modules, NavigationModule has correct items, TextModule has clean HTML not [object Object], GalleryModule has image arrays.",

            "tag_to_module_mapping": {
              "description": "Rules for converting extracted tag data into HTML Builder module instances",
              "rules": [
                {
                  "tags": ["nav", "header", "navigation"],
                  "module_type": "NavigationModule",
                  "data_mapping": "items[] with {text, href} → {items: [{label: text, url: href}]}",
                  "layout_class": "card-full",
                  "order": 0,
                  "show_on_all_pages": true
                },
                {
                  "tags": ["hero", "banner", "jumbotron"],
                  "module_type": "HeroModule",
                  "data_mapping": "headline text + subtitle + background image + CTA link → {headline, subtext, backgroundImage, ctaText, ctaUrl}",
                  "layout_class": "hero-full",
                  "order": 10,
                  "show_on_all_pages": false
                },
                {
                  "tags": ["content", "text", "body", "article", "post-body", "main"],
                  "module_type": "TextModule",
                  "data_mapping": ".html (preferred) or .text → {content: html_string}",
                  "layout_class": "card-full",
                  "order": 20,
                  "show_on_all_pages": false
                },
                {
                  "tags": ["gallery", "images", "photos", "slideshow"],
                  "module_type": "GalleryModule",
                  "data_mapping": "images[] → {images: [{src, alt, caption}]}",
                  "layout_class": "card-wide",
                  "order": 30,
                  "show_on_all_pages": false
                },
                {
                  "tags": ["cards", "grid", "listings"],
                  "module_type": "CardsModule",
                  "data_mapping": "Repeated items with title + image + link → {items: [...]}",
                  "layout_class": "card-wide",
                  "order": 25,
                  "show_on_all_pages": false
                },
                {
                  "tags": ["sidebar"],
                  "module_type": "TextModule",
                  "data_mapping": ".html or .text → {content: html_string}",
                  "layout_class": "card",
                  "order": 25,
                  "show_on_all_pages": false
                },
                {
                  "tags": ["footer"],
                  "module_type": "TextModule",
                  "data_mapping": "links + copyright text → {content: assembled_html}",
                  "layout_class": "card-full",
                  "order": 100,
                  "show_on_all_pages": true
                },
                {
                  "tags": ["meta", "seo", "head"],
                  "module_type": "MetaModule",
                  "data_mapping": "title + description → {title, description, keywords}",
                  "layout_class": null,
                  "order": -1,
                  "show_on_all_pages": false,
                  "hidden": true
                }
              ]
            },

            "data_flattening": {
              "description": "The critical step that prevents [object Object] rendering in the builder",
              "problem": "Extracted values are often nested objects like {text, html, links, images} instead of plain strings. If you pass these directly to a module, it renders as [object Object].",
              "solution": {
                "for_text_modules": "Use .html field (preferred) or .text field. Never pass the whole object.",
                "for_navigation_modules": "Extract .links array, map to {label: link.text, url: link.href}.",
                "for_gallery_modules": "Extract .images array directly, already in correct format.",
                "for_hero_modules": "Extract individual fields: first heading as headline, first paragraph as subtext, first image as backgroundImage, first link as CTA.",
                "general_rule": "Always resolve to a primitive (string, number) or a flat array of primitives/simple objects before passing to a module."
              },
              "example_before": {
                "content": {
                  "text": "Captain Jerry has been sailing...",
                  "html": "<p>Captain Jerry has been sailing...</p>",
                  "links": [{"text": "Mexico", "href": "/route/mexico"}],
                  "images": []
                }
              },
              "example_after": {
                "content_module": {
                  "type": "TextModule",
                  "data": {
                    "content": "<p>Captain Jerry has been sailing...</p>"
                  }
                }
              }
            },

            "page_assembly": {
              "description": "How modules are ordered within each generated page",
              "order": [
                "MetaModule (hidden, order: -1)",
                "NavigationModule (showOnAllPages: true, order: 0, layout: card-full)",
                "HeroModule (order: 10, layout: hero-full)",
                "TextModule/content (order: 20, layout: card-full)",
                "CardsModule (order: 25, layout: card-wide)",
                "GalleryModule (order: 30, layout: card-wide)",
                "TextModule/footer (showOnAllPages: true, order: 100, layout: card-full)"
              ],
              "layout_classes_available": ["card", "card-wide", "card-full", "hero-full"]
            },

            "tag_groups_format": {
              "description": "Flattened key-value pairs derived from extracted data. Used by the HTML Builder's data streams panel for drag-to-bind functionality.",
              "purpose": "Lets the user drag data fields onto module properties in the builder UI instead of manually typing values.",
              "example": {
                "page_title": "Dragon Lady - Home",
                "nav_link_1_text": "welcome",
                "nav_link_1_href": "/",
                "hero_headline": "The Dragon Lady",
                "hero_image": "/images/hero-bg.jpg",
                "content_text": "Captain Jerry has been sailing...",
                "gallery_image_1_src": "/photos/dock-1.jpg",
                "gallery_image_1_alt": "At the dock"
              }
            }
          },
          {
            "step": 13,
            "command": "validate --convert",
            "title": "Conversion QA",
            "what": "Round-trip check — does the converted data match extraction?",
            "details": "Compares extracted content against converted modules. Flags: text that was lost in conversion, links that didn't make it, images with wrong paths, pages with no modules, modules with empty content.",
            "inputs": ["jobs/{site_name}/extracted.json", "jobs/{site_name}/htmlbuilder_import.json"],
            "outputs": ["jobs/{site_name}/conversion_report.json"],
            "python_script": "validate.py --convert",
            "cli": "./start.sh validate jobs/example_com/ --convert",
            "gate": true,
            "gate_rule": "If content was lost in conversion, fix the mapping rules in convert step and re-run.",
            "optional": false,
            "noculars": null
          }
        ]
      },
      {
        "phase": "DELIVER",
        "number": 6,
        "description": "Into the builder, export, done",
        "steps": [
          {
            "step": 14,
            "command": "import",
            "title": "Load into Builder",
            "what": "Import accid_dropper.json + tag_groups.json into HTML Builder",
            "details": "Copies output files into htmlbuilder_local/. The import command handles file placement. User opens builder with ?edit, sees all pages with content populated. Data streams panel shows tag_groups ready for binding.",
            "inputs": ["jobs/{site_name}/accid_dropper.json", "jobs/{site_name}/tag_groups.json"],
            "outputs": ["htmlbuilder_local/accid_dropper.json", "htmlbuilder_local/tag_groups.json"],
            "cli": "./start.sh import jobs/example_com/ --to=htmlbuilder_local/",
            "cli_flags": {
              "--to=PATH": "Target directory for import (default: htmlbuilder_local/)"
            },
            "gate": false,
            "optional": false,
            "noculars": null
          },
          {
            "step": 15,
            "command": "export",
            "title": "Build & Ship",
            "what": "ZIP download or FTP upload of static HTML site",
            "details": "Happens in the HTML Builder UI, not CLI. User arranges layout, tweaks styles, exports. Static HTML, CSS, no JS dependencies. Fast, permanent, zero maintenance.",
            "inputs": ["Builder UI"],
            "outputs": ["Final static HTML site (ZIP or FTP)"],
            "gate": false,
            "optional": false,
            "noculars": null,
            "note": "This step is in the builder, not the scraper pipeline"
          }
        ]
      }
    ]
  },

  "json_noculars": {
    "description": "Visual HTML/JSON data inspection tool. Lets you see what data lives inside HTML elements or JSON structures before writing extraction rules. Think of it as X-ray goggles for web data.",
    "location": "tagger_playwright_local/json-noculars.html",
    "technology": "Self-contained HTML file with React 18 via CDN, Babel for JSX transpilation",
    "persistence": "Scope library patterns saved in browser localStorage",
    "no_build_step": true,

    "modes": {
      "html": {
        "description": "Paste or load raw HTML. Renders it in a preview pane. X-ray mode shows data badges on hover.",
        "features": [
          "X-Ray Mode: hover elements to see badge counts (links, images, text length)",
          "Blue tooltip: tag name, data summary, text preview",
          "Click to select: orange border, added to selected list",
          "Tag naming: label each selection for scope output",
          "Scope accumulator: build up scope.json piece by piece"
        ],
        "data_extraction": {
          "links": "All <a href> elements with text and href",
          "images": "All <img src> elements with src and alt",
          "text": "Direct text nodes, skipping <script> and <style>",
          "selector": "Generated CSS selector path (tag + id + classes)"
        }
      },
      "json": {
        "description": "Paste or load JSON data. Renders as interactive tree with type analysis.",
        "features": [
          "Type badges: strings, numbers, urls, images, arrays, objects counts",
          "Click to select any node in the tree",
          "Breadcrumb navigation for nested objects",
          "Array preview (first 5 items shown)",
          "URL and image detection in string values"
        ]
      }
    },

    "scope_accumulator": {
      "description": "Build up a complete scope.json by selecting elements one at a time",
      "workflow": [
        "1. Select element(s) in preview",
        "2. Type tag name (e.g., 'nav', 'hero', 'content')",
        "3. Click '+ Scope' to add to accumulator",
        "4. Repeat for all sections of the page",
        "5. Review in Scope tab",
        "6. Copy or Download the generated scope.json"
      ],
      "output_format": "Standard scope.json with inner_selector, strip_selectors, html_regex_cleanup per tag"
    },

    "scope_library": {
      "description": "Reusable scope patterns — both built-in and user-saved",
      "builtin_patterns": [
        { "name": "Navigation Links", "selector": "nav a, .nav a, [class*=nav] a" },
        { "name": "Hero Section", "selector": ".hero, [class*=hero], .banner" },
        { "name": "Gallery Images", "selector": ".gallery img, [class*=gallery] img, .grid img" },
        { "name": "Article Content", "selector": "article, .content, .post-body, main", "strips": [".share-buttons", "[class*=newsletter]", "[class*=related]", ".sidebar", "[class*=comment]"] },
        { "name": "Footer Links", "selector": "footer a, .footer a, [class*=footer] a" },
        { "name": "Product Card", "selector": ".product, [class*=product], .card" },
        { "name": "Social Links", "selector": "[class*=social] a, .social a" },
        { "name": "Meta / SEO", "selector": "head title, head meta[name=description], head meta[property^=og:]" }
      ],
      "user_patterns": "Saved to localStorage, persist across sessions. Each pattern has name, description, and scope object.",
      "usage": "Click 'Use' on any library pattern to add it to the current scope accumulator"
    },

    "integration_with_pipeline": {
      "launched_by": "./start.sh scopes jobs/example_com/",
      "server": "python3 -m http.server 9860 from project root",
      "url_params": {
        "job": "Path to job directory (e.g., jobs/example_com)",
        "file": "Specific file to pre-load (e.g., html/index.html)"
      },
      "file_loading": "Drag & drop .html or .json files onto the window, or use URL params for auto-loading",
      "output": "Download button saves scopes.json directly. Move to job folder."
    },

    "future_enhancements": {
      "v3_planned": [
        "Live URL mode: fetch and inspect with ::before/::after pseudo-element content",
        "Direct job folder integration: read/write scopes.json without download/copy",
        "Side-by-side: HTML preview + extracted JSON comparison",
        "Diff mode: compare two extracted.json files"
      ]
    }
  },

  "known_gotchas": {
    "object_object": {
      "description": "The #1 bug in the pipeline. Extracted values are nested objects {text, html, links, images}. If passed directly to a module, they render as [object Object] in the browser.",
      "where_it_happens": "Step 12 (convert) if data flattening is not done correctly",
      "how_to_fix": "Always resolve nested objects to their appropriate primitive field: .html for TextModules, .links array for NavigationModules, .images array for GalleryModules",
      "detection": "Step 11 (validate:extract) flags values that are objects instead of strings"
    },
    "spa_sites": {
      "description": "Single Page Applications (React, Vue, Angular) render content via JavaScript. The HTML source is often just a <div id='root'></div>.",
      "where_it_happens": "Step 6 (runner) may get empty HTML if JS doesn't execute fully",
      "how_to_fix": "Runner uses Playwright which executes JS, but some SPAs need extra wait time. Use custom action sets with longer timeouts. Audit (step 3) detects this early."
    },
    "cdn_images": {
      "description": "Images hosted on CDNs (Cloudflare, imgix, wp.com) may block hotlinking or use dynamic URLs that expire.",
      "where_it_happens": "Extracted image URLs may 404 when used in the built site",
      "how_to_fix": "Download images locally during runner step, rewrite URLs in convert step. Audit (step 3) detects CDN usage."
    },
    "relative_urls": {
      "description": "Links and images with relative paths (e.g., /images/hero.jpg) need a base URL prepended.",
      "where_it_happens": "Step 10 (extract) outputs relative URLs, step 12 (convert) needs to resolve them",
      "how_to_fix": "Convert step should prepend the source site's base URL to all relative paths"
    },
    "cookie_banners": {
      "description": "Cookie consent banners overlay content and may be captured as part of the page.",
      "where_it_happens": "Step 6 (runner) captures the banner, step 10 (extract) may include banner text",
      "how_to_fix": "Use action sets that click the dismiss/accept button. Or add cookie banner selectors to strip_selectors in scopes."
    },
    "every_site_is_different": {
      "description": "There is no one-size-fits-all extraction. .post-body vs article.entry-content vs <main>. Share buttons might be .share, .social-share, [class*=share], or a completely custom name.",
      "philosophy": "Human judgment writes targeted selectors per site. JSON-noculars makes that judgment faster by showing the data visually. Automation attempts create brittle systems that work on 3 sites and fail on the 4th."
    }
  },

  "html_builder_integration": {
    "description": "How scraper output connects to the ACCID HTML Builder",
    "import_files": {
      "accid_dropper.json": "Complete module structure per page. This is what the builder loads on startup.",
      "tag_groups.json": "Flattened key-value data for the streams panel drag-to-bind system."
    },
    "builder_module_types_used": [
      "NavigationModule — items array with label + url",
      "HeroModule — headline, subtext, backgroundImage, CTA",
      "TextModule — content (HTML string)",
      "GalleryModule — images array with src, alt, caption",
      "CardsModule — repeated items with title, image, link",
      "MetaModule — title, description, keywords (hidden)"
    ],
    "data_streams_panel": {
      "description": "Builder UI panel where tag_groups data appears as draggable fields",
      "usage": "Drag a field from the streams panel onto a module property to bind them",
      "powered_by": "AccidDataAdapters normalizes JSON/CSV → tag_groups, AccidTemplateReplicator duplicates layouts"
    },
    "builder_features_relevant_to_scraper": {
      "multi_page": "PageManager handles multiple pages, showOnAllPages for nav/footer",
      "layout_classes": "card, card-wide, card-full, hero-full — control module width",
      "grid_system": "CSS Grid v2 with auto-fill, layout classes, row breakers",
      "per_module_spacing": "Margin/padding per module (spec complete, implementation pending)",
      "export": "ZIP download or FTP upload of final static site"
    }
  },

  "start_stepped_sh": {
    "description": "The main entry point shell script that routes commands to Python scripts, manages venv, and provides contextual next-step hints.",
    "location": "tagger_playwright_local/start_stepped.sh",
    "features": [
      "Auto-creates Python virtual environment on first run",
      "Auto-installs all dependencies (playwright, beautifulsoup4, etc.)",
      "Routes commands to correct Python scripts",
      "Shows contextual next-step hints after each command completes",
      "Gate checks read output files (e.g., audit.json complexity score) and warn accordingly",
      "The scopes command starts a local HTTP server and opens JSON-noculars in the browser",
      "The import command copies output files to the builder directory",
      "Unknown commands show compact help grouped by phase"
    ],
    "command_aliases": {
      "runner": ["run", "fetch"],
      "scopes": ["scope"],
      "validate": ["compare", "diff"]
    }
  },

  "utility_commands": {
    "status": {
      "command": "./start.sh status jobs/example_com/",
      "what": "Shows pipeline progress — which steps complete, counts at each stage, suggests next command",
      "script": "status.py"
    },
    "validate_roundtrip": {
      "command": "./start.sh validate jobs/original/ jobs/rebuilt/",
      "what": "Compares original extraction against rebuilt site extraction. Checks page count, text similarity, image count, link count. Verdict: GOOD / MINOR / NEEDS ATTENTION.",
      "script": "validate.py (two-dir mode)"
    },
    "vault_export": {
      "command": "./start.sh vault-export jobs/example_com/",
      "what": "WordPress only. Exports CMS structure: categories, tags, archives, taxonomies, author pages, date archives, search index.",
      "script": "vault_export.py",
      "when": "Only for WordPress sites. Skip for static HTML."
    }
  },

  "success_rates": {
    "content_extraction": {
      "rate": "85-90%",
      "notes": "JSON-noculars + scope system gets clean usable data from most sites. Remaining 10-15% is JS-rendered content, auth walls, CDN blocking, and deeply weird markup."
    },
    "data_to_module_mapping": {
      "rate": "60-65%",
      "notes": "Tag-to-module mapping works for standard content types. Complex layouts (multi-column, nested cards, custom widgets) need manual adjustment.",
      "improvement_path": "Auto-detection of tag content type (has items with text+href → NavigationModule) would push this to 90%+"
    },
    "building_new_sites_from_scraped_data": {
      "rate": "75-80%",
      "notes": "This is the real use case — not recreating the original, but making new sites from scraped content. With JSON-noculars for scoping and the builder for layout, this workflow is effective now."
    },
    "visual_clone_of_original": {
      "rate": "30-40%",
      "notes": "NOT the goal. Limited per-module styling, no animations, coarse layout control. But this is by design — we're liberating content, not cloning designs."
    }
  }
}

Show less

What This System Does

The ACCID Scraper Pipeline takes any website, extracts its content (text, links, images, navigation), and converts it into a format the ACCID HTML Builder can import. The output is a static HTML site with the original content. No databases, no APIs, no frameworks.

What it is NOT: This is not a visual clone tool. We don't recreate the source site's layout, fonts, or design. We extract CONTENT so the user can start fresh with it. Think of it as a content liberation tool.

The pitch: "Your content is trapped in WordPress behind PHP, MySQL, themes, plugins, and a hosting bill. We pull it all out, hand it back as clean static HTML. It loads in 50ms, costs nothing to host, never breaks, never needs updating. Plus you get tag-based search and dynamic category pages for free."

Pipeline Overview

15 steps across 6 phases, with 3 gate checkpoints and 5 JSON-noculars integration points.

Phase	Steps	Purpose
RECON	1-4: setup, discover, audit, login	Understand the target before touching it
CAPTURE	5-7: curate, runner, preview	Get the raw HTML, fully rendered
SCOPE	8-9: tagger, scopes	Define what to extract (noculars lives here)
EXTRACT	10-11: extract, validate	Apply scopes to HTML, get structured data
CONVERT	12-13: convert, validate	Transform data into builder modules
DELIVER	14-15: import, export	Into the builder, ship the static site

Every Step in Detail

Phase: RECON

Step 1	setup	Project Wizard
What	Create job folder, choose extraction path (Full Access vs Scrape Only)
CLI	./start.sh setup
Outputs	jobs/{site_name}/config.json

Step 2	discover	Crawl & Map URLs
What	Find all URLs via sitemap.xml, robots.txt, and link crawling. Detect platform.
CLI	./start.sh discover https://example.com –max-pages=50
Outputs	jobs/{site_name}/urls.json

Step 3	audit	NOPE Detector
What	Asset analysis, complexity score 1-10, SPA detection, auth detection, CDN analysis.
CLI	./start.sh audit jobs/example_com/
Outputs	jobs/{site_name}/audit.json
GATE	If score > 7 or SPA detected, stop and assess. If auth required, run login before runner.

Asset analysis, complexity score 1-10, SPA detection, auth detection, CDN analysis.

Step 4	login	Auth Capture (Optional)
What	Opens visible browser for manual login. Captures cookies + localStorage. Runner reuses session.
CLI	./start.sh login jobs/example_com/
Outputs	jobs/{site_name}/auth/storage_state.json

Opens visible browser for manual login. Captures cookies + localStorage. Runner reuses session.

Phase: CAPTURE

Step 5	curate	Review & Classify URLs
What	Assign page types: homepage, post, page, category, gallery. Skip unwanted URLs.
CLI	./start.sh curate jobs/example_com/
Outputs	jobs/{site_name}/urls.json (updated)

Step 6	runner	Fetch All Pages
What	Playwright fetches every URL. Handles JS rendering, cookies, lazy loading, action sets.
CLI	./start.sh runner jobs/example_com/ –all
Outputs	jobs/{site_name}/html/*.html jobs/{site_name}/fetched.json

Playwright fetches every URL. Handles JS rendering, cookies, lazy loading, action sets.

Step 7	preview	Visual QA
What	Screenshots + Excel report. Verify we captured real content, not blank pages.
CLI	./start.sh preview jobs/example_com/ –sample=5
Outputs	jobs/{site_name}/preview/report.xlsx
GATE	Review screenshots. If pages are blank or blocked, re-run runner with login or different settings.

Phase: SCOPE

Step 8	tagger	Tag Elements
What	Map CSS selectors to tag names. This is WHERE you click.
CLI	./start.sh tagger jobs/example_com/
Outputs	jobs/{site_name}/tags.json
{ }	NOCULARS: USE HERE After tagging, paste sample HTML into JSON-noculars to verify each tag contains the expected data (links, images, text).

Step 9	scopes	Scope & Clean
What	Write scope.json with inner_selector, strip_selectors, regex cleanup. This IS the noculars workflow.
CLI	./start.sh scopes jobs/example_com/
Outputs	jobs/{site_name}/scopes.json
{ }	NOCULARS: PRIMARY Starts local server on port 9860, opens noculars in browser. Paste HTML, X-ray, click, name, Add to Scope, Download scopes.json.

Write scope.json with inner_selector, strip_selectors, regex cleanup. This IS the noculars workflow.

Phase: EXTRACT

Step 10	extract	Pull Structured Content
What	Apply tags + scopes to every page. Extract text, html, links, images per tag.
CLI	./start.sh extract jobs/example_com/ –clean
Outputs	jobs/{site_name}/extracted.json
{ }	NOCULARS: VERIFY Load extracted.json in JSON mode. Verify nav has links, content has clean text, gallery has images. If wrong, adjust scopes.

Step 11	validate –extract	Extraction QA
What	Automated checks: missing tags, empty fields, [object Object] values, broken image URLs.
CLI	./start.sh validate jobs/example_com/ –extract
Outputs	jobs/{site_name}/extraction_report.json
GATE	If >10% of pages have issues, fix scopes (step 9) and re-extract.
{ }	NOCULARS: DEBUG When issues flagged, load problem page data in JSON mode to diagnose. Load raw HTML in HTML mode to see why scope missed it.

Automated checks: missing tags, empty fields, [object Object] values, broken image URLs.

Phase: CONVERT

Step 12	convert	Map Data to Modules
What	Generate htmlbuilder_import.json + tag_groups.json. Flatten nested data. Assemble pages.
CLI	./start.sh convert jobs/example_com/ –tag-groups
Outputs	jobs/{site_name}/htmlbuilder_import.json jobs/{site_name}/tag_groups.json jobs/{site_name}/accid_dropper.json
{ }	NOCULARS: VERIFY Load htmlbuilder_import.json in JSON mode. Verify modules have correct content, no [object Object].

Generate htmlbuilder_import.json + tag_groups.json. Flatten nested data. Assemble pages.

Step 13	validate –convert	Conversion QA
What	Round-trip check: does converted data match extraction? Flags lost text, links, images.
CLI	./start.sh validate jobs/example_com/ –convert
Outputs	jobs/{site_name}/conversion_report.json
GATE	If content was lost in conversion, fix mapping rules and re-run.

Round-trip check: does converted data match extraction? Flags lost text, links, images.

Phase: DELIVER

Step 14	import	Load into Builder
What	Copy accid_dropper.json + tag_groups.json into htmlbuilder_local/.
CLI	./start.sh import jobs/example_com/ –to=htmlbuilder_local/
Outputs	htmlbuilder_local/accid_dropper.json

Step 15	export	Build & Ship
What	Arrange layout in builder, export as static HTML. ZIP download or FTP upload.
CLI	(In builder UI)
Outputs	Final static HTML site

JSON-noculars

Visual HTML and JSON data inspection tool. Think of it as X-ray goggles for web data. It lets you see what data lives inside HTML elements or JSON structures before writing extraction rules.

Two Modes

HTML Mode: Paste raw HTML. It renders in a preview pane. X-ray mode shows data badges on hover — how many links, images, and characters of text each element contains. Click to select. Name the tag. Add to scope. Build up your entire scope.json visually.

JSON Mode: Paste JSON data (database exports, extracted.json, API responses). Renders as an interactive tree with type analysis. Every node shows counts: strings, numbers, URLs, images, arrays, objects. Click any node to select it for scope output.

Scope Library

Eight built-in patterns cover common extraction targets: Navigation Links, Hero Section, Gallery Images, Article Content, Footer Links, Product Card, Social Links, Meta/SEO. Click "Use" to apply any pattern to your current scope.

Your own patterns persist in browser localStorage. When you find a scope pattern that works well, save it with a name and description. Next time you scrape a similar site, it's one click to reuse it.

How It Integrates

The scopes command (./start.sh scopes jobs/example_com/**) starts a local Python HTTP server on port 9860, finds a sample HTML file from the job folder, and opens JSON-noculars in your default browser with the file pre-loaded. You can also drag and drop files onto the window.**

Workflow

Hover elements to see data badges (links, images, text length)
Click to select — orange border marks the selection
Type a tag name (nav, hero, content, gallery, footer)
Click "+ Scope" to add to the accumulator
Repeat for all sections of the page
Click "Download" to save scopes.json
Move scopes.json into your job folder

Converter: Tag to Module Mapping

Step 12 (convert) is the bridge between scraper and builder. These are the rules for converting extracted tags into HTML Builder modules.

Tag Names	Module Type	Data Shape	Layout	Order
nav, header	NavigationModule	{items: [{label, url}]}	card-full	0
hero, banner	HeroModule	{headline, subtext, backgroundImage}	hero-full	10
content, text, article	TextModule	{content: html_string}	card-full	20
gallery, images	GalleryModule	{images: [{src, alt, caption}]}	card-wide	30
cards, grid	CardsModule	{items: […]}	card-wide	25
footer	TextModule	{content: html_string}	card-full	100
meta, seo	MetaModule	{title, description}	hidden	-1

The [object Object] Killer

The number one bug in the pipeline. Extracted values are often nested objects like {text, html, links, images} instead of plain strings. If you pass these directly to a module, it renders as [object Object] in the browser.

The fix: Always resolve nested objects to their appropriate primitive field. TextModule gets .html (preferred) or .text. NavigationModule gets .links mapped to {label, url}. GalleryModule gets .images array directly. Never pass the whole object.

Page Assembly Order

Each generated page gets modules in this order: MetaModule (hidden) → NavigationModule (showOnAllPages, card-full) → HeroModule (hero-full) → TextModule/content (card-full) → GalleryModule (card-wide) → TextModule/footer (showOnAllPages, card-full).

Known Gotchas

[object Object]: Nested objects passed to modules render as literal text. Flatten in converter step.

SPA Sites: React/Vue/Angular apps may serve empty HTML. Runner uses Playwright for JS execution, but some apps need extra wait time. Audit detects this early.

CDN Images: Images on Cloudflare, imgix, wp.com may block hotlinking or use expiring URLs. Download locally during runner, rewrite URLs in convert.

Relative URLs: Links like /images/hero.jpg need the source site's base URL prepended during conversion.

Cookie Banners: Captured as page content. Use action sets that click dismiss, or add banner selectors to strip_selectors in scopes.

Every Site Is Different: No one-size-fits-all extraction. .post-body vs article.entry-content vs

. Human judgment writes selectors. JSON-noculars makes that judgment faster.

Expected Success Rates

Task	Rate	Notes
Content extraction from sites	85-90%	Remaining 10-15% is JS-rendered content, auth walls, CDN blocking
Data to module mapping	60-65%	Standard content types work. Complex layouts need manual adjustment.
Building new sites from scraped data	75-80%	The real use case. Not cloning, but building fresh from extracted content.
Visual clone of original site	30-40%	Not the goal. By design. We liberate content, not clone designs.

Quick Reference: All Commands

Command	What It Does
./start.sh setup	Create new project
./start.sh discover URL –max-pages=50	Find all URLs
./start.sh audit jobs/site/	NOPE detector (GATE)
./start.sh login jobs/site/	Capture auth session (optional)
./start.sh curate jobs/site/	Classify URLs by type
./start.sh runner jobs/site/ –all	Fetch all pages
./start.sh preview jobs/site/ –sample=5	Visual QA (GATE)
./start.sh tagger jobs/site/	Map selectors to tags
./start.sh scopes jobs/site/	Build scope.json with noculars
./start.sh extract jobs/site/ –clean	Extract structured content
./start.sh validate jobs/site/ –extract	Extraction QA (GATE)
./start.sh convert jobs/site/ –tag-groups	Generate builder modules
./start.sh validate jobs/site/ –convert	Conversion QA (GATE)
./start.sh import jobs/site/ –to=htmlbuilder_local/	Load into builder
./start.sh status jobs/site/	Check pipeline progress
./start.sh vault-export jobs/site/	WordPress CMS export

+ JSON-noculars

Complete System Documentation

Related Posts