ACCID Scraper Pipeline

+ JSON-noculars

Complete System Documentation

Version 2.0 — February 11, 2026

What This System Does

The ACCID Scraper Pipeline takes any website, extracts its content (text, links, images, navigation), and converts it into a format the ACCID HTML Builder can import. The output is a static HTML site with the original content. No databases, no APIs, no frameworks.

What it is NOT: This is not a visual clone tool. We don't recreate the source site's layout, fonts, or design. We extract CONTENT so the user can start fresh with it. Think of it as a content liberation tool.

The pitch: "Your content is trapped in WordPress behind PHP, MySQL, themes, plugins, and a hosting bill. We pull it all out, hand it back as clean static HTML. It loads in 50ms, costs nothing to host, never breaks, never needs updating. Plus you get tag-based search and dynamic category pages for free."

Pipeline Overview

15 steps across 6 phases, with 3 gate checkpoints and 5 JSON-noculars integration points.

Phase Steps Purpose
RECON 1-4: setup, discover, audit, login Understand the target before touching it
CAPTURE 5-7: curate, runner, preview Get the raw HTML, fully rendered
SCOPE 8-9: tagger, scopes Define what to extract (noculars lives here)
EXTRACT 10-11: extract, validate Apply scopes to HTML, get structured data
CONVERT 12-13: convert, validate Transform data into builder modules
DELIVER 14-15: import, export Into the builder, ship the static site

Every Step in Detail

Phase: RECON

Step 1 setup Project Wizard
What Create job folder, choose extraction path (Full Access vs Scrape Only)  
CLI ./start.sh setup  
Outputs jobs/{site_name}/config.json  
Step 2 discover Crawl & Map URLs
What Find all URLs via sitemap.xml, robots.txt, and link crawling. Detect platform.  
CLI ./start.sh discover https://example.com –max-pages=50  
Outputs jobs/{site_name}/urls.json  
Step 3 audit NOPE Detector
What Asset analysis, complexity score 1-10, SPA detection, auth detection, CDN analysis.  
CLI ./start.sh audit jobs/example_com/  
Outputs jobs/{site_name}/audit.json  
GATE If score > 7 or SPA detected, stop and assess. If auth required, run login before runner.  

Asset analysis, complexity score 1-10, SPA detection, auth detection, CDN analysis.

Step 4 login Auth Capture (Optional)
What Opens visible browser for manual login. Captures cookies + localStorage. Runner reuses session.  
CLI ./start.sh login jobs/example_com/  
Outputs jobs/{site_name}/auth/storage_state.json  

Opens visible browser for manual login. Captures cookies + localStorage. Runner reuses session.

Phase: CAPTURE

Step 5 curate Review & Classify URLs
What Assign page types: homepage, post, page, category, gallery. Skip unwanted URLs.  
CLI ./start.sh curate jobs/example_com/  
Outputs jobs/{site_name}/urls.json (updated)  
Step 6 runner Fetch All Pages
What Playwright fetches every URL. Handles JS rendering, cookies, lazy loading, action sets.  
CLI ./start.sh runner jobs/example_com/ –all  
Outputs jobs/{site_name}/html/*.html jobs/{site_name}/fetched.json  

Playwright fetches every URL. Handles JS rendering, cookies, lazy loading, action sets.

Step 7 preview Visual QA
What Screenshots + Excel report. Verify we captured real content, not blank pages.  
CLI ./start.sh preview jobs/example_com/ –sample=5  
Outputs jobs/{site_name}/preview/report.xlsx  
GATE Review screenshots. If pages are blank or blocked, re-run runner with login or different settings.  

Phase: SCOPE

Step 8 tagger Tag Elements
What Map CSS selectors to tag names. This is WHERE you click.  
CLI ./start.sh tagger jobs/example_com/  
Outputs jobs/{site_name}/tags.json  
{ } NOCULARS: USE HERE After tagging, paste sample HTML into JSON-noculars to verify each tag contains the expected data (links, images, text).  
Step 9 scopes Scope & Clean
What Write scope.json with inner_selector, strip_selectors, regex cleanup. This IS the noculars workflow.  
CLI ./start.sh scopes jobs/example_com/  
Outputs jobs/{site_name}/scopes.json  
{ } NOCULARS: PRIMARY Starts local server on port 9860, opens noculars in browser. Paste HTML, X-ray, click, name, Add to Scope, Download scopes.json.  

Write scope.json with inner_selector, strip_selectors, regex cleanup. This IS the noculars workflow.

Phase: EXTRACT

Step 10 extract Pull Structured Content
What Apply tags + scopes to every page. Extract text, html, links, images per tag.  
CLI ./start.sh extract jobs/example_com/ –clean  
Outputs jobs/{site_name}/extracted.json  
{ } NOCULARS: VERIFY Load extracted.json in JSON mode. Verify nav has links, content has clean text, gallery has images. If wrong, adjust scopes.  
Step 11 validate –extract Extraction QA
What Automated checks: missing tags, empty fields, [object Object] values, broken image URLs.  
CLI ./start.sh validate jobs/example_com/ –extract  
Outputs jobs/{site_name}/extraction_report.json  
GATE If >10% of pages have issues, fix scopes (step 9) and re-extract.  
{ } NOCULARS: DEBUG When issues flagged, load problem page data in JSON mode to diagnose. Load raw HTML in HTML mode to see why scope missed it.  

Automated checks: missing tags, empty fields, [object Object] values, broken image URLs.

Phase: CONVERT

Step 12 convert Map Data to Modules
What Generate htmlbuilder_import.json + tag_groups.json. Flatten nested data. Assemble pages.  
CLI ./start.sh convert jobs/example_com/ –tag-groups  
Outputs jobs/{site_name}/htmlbuilder_import.json jobs/{site_name}/tag_groups.json jobs/{site_name}/accid_dropper.json  
{ } NOCULARS: VERIFY Load htmlbuilder_import.json in JSON mode. Verify modules have correct content, no [object Object].  

Generate htmlbuilder_import.json + tag_groups.json. Flatten nested data. Assemble pages.

Step 13 validate –convert Conversion QA
What Round-trip check: does converted data match extraction? Flags lost text, links, images.  
CLI ./start.sh validate jobs/example_com/ –convert  
Outputs jobs/{site_name}/conversion_report.json  
GATE If content was lost in conversion, fix mapping rules and re-run.  

Round-trip check: does converted data match extraction? Flags lost text, links, images.

Phase: DELIVER

Step 14 import Load into Builder
What Copy accid_dropper.json + tag_groups.json into htmlbuilder_local/.  
CLI ./start.sh import jobs/example_com/ –to=htmlbuilder_local/  
Outputs htmlbuilder_local/accid_dropper.json  
Step 15 export Build & Ship
What Arrange layout in builder, export as static HTML. ZIP download or FTP upload.  
CLI (In builder UI)  
Outputs Final static HTML site  

JSON-noculars

Visual HTML and JSON data inspection tool. Think of it as X-ray goggles for web data. It lets you see what data lives inside HTML elements or JSON structures before writing extraction rules.

Two Modes

HTML Mode: Paste raw HTML. It renders in a preview pane. X-ray mode shows data badges on hover — how many links, images, and characters of text each element contains. Click to select. Name the tag. Add to scope. Build up your entire scope.json visually.

JSON Mode: Paste JSON data (database exports, extracted.json, API responses). Renders as an interactive tree with type analysis. Every node shows counts: strings, numbers, URLs, images, arrays, objects. Click any node to select it for scope output.

Scope Library

Eight built-in patterns cover common extraction targets: Navigation Links, Hero Section, Gallery Images, Article Content, Footer Links, Product Card, Social Links, Meta/SEO. Click "Use" to apply any pattern to your current scope.

Your own patterns persist in browser localStorage. When you find a scope pattern that works well, save it with a name and description. Next time you scrape a similar site, it's one click to reuse it.

How It Integrates

The scopes command (./start.sh scopes jobs/example_com/**) starts a local Python HTTP server on port 9860, finds a sample HTML file from the job folder, and opens JSON-noculars in your default browser with the file pre-loaded. You can also drag and drop files onto the window.**

Workflow

  1. Hover elements to see data badges (links, images, text length)
  2. Click to select — orange border marks the selection
  3. Type a tag name (nav, hero, content, gallery, footer)
  4. Click "+ Scope" to add to the accumulator
  5. Repeat for all sections of the page
  6. Click "Download" to save scopes.json
  7. Move scopes.json into your job folder

Converter: Tag to Module Mapping

Step 12 (convert) is the bridge between scraper and builder. These are the rules for converting extracted tags into HTML Builder modules.

Tag Names Module Type Data Shape Layout Order
nav, header NavigationModule {items: [{label, url}]} card-full 0
hero, banner HeroModule {headline, subtext, backgroundImage} hero-full 10
content, text, article TextModule {content: html_string} card-full 20
gallery, images GalleryModule {images: [{src, alt, caption}]} card-wide 30
cards, grid CardsModule {items: […]} card-wide 25
footer TextModule {content: html_string} card-full 100
meta, seo MetaModule {title, description} hidden -1

The [object Object] Killer

The number one bug in the pipeline. Extracted values are often nested objects like {text, html, links, images} instead of plain strings. If you pass these directly to a module, it renders as [object Object] in the browser.

The fix: Always resolve nested objects to their appropriate primitive field. TextModule gets .html (preferred) or .text. NavigationModule gets .links mapped to {label, url}. GalleryModule gets .images array directly. Never pass the whole object.

Page Assembly Order

Each generated page gets modules in this order: MetaModule (hidden) → NavigationModule (showOnAllPages, card-full) → HeroModule (hero-full) → TextModule/content (card-full) → GalleryModule (card-wide) → TextModule/footer (showOnAllPages, card-full).

Known Gotchas

[object Object]: Nested objects passed to modules render as literal text. Flatten in converter step.

SPA Sites: React/Vue/Angular apps may serve empty HTML. Runner uses Playwright for JS execution, but some apps need extra wait time. Audit detects this early.

CDN Images: Images on Cloudflare, imgix, wp.com may block hotlinking or use expiring URLs. Download locally during runner, rewrite URLs in convert.

Relative URLs: Links like /images/hero.jpg need the source site's base URL prepended during conversion.

Cookie Banners: Captured as page content. Use action sets that click dismiss, or add banner selectors to strip_selectors in scopes.

Every Site Is Different: No one-size-fits-all extraction. .post-body vs article.entry-content vs

. Human judgment writes selectors. JSON-noculars makes that judgment faster.

Expected Success Rates

Task Rate Notes
Content extraction from sites 85-90% Remaining 10-15% is JS-rendered content, auth walls, CDN blocking
Data to module mapping 60-65% Standard content types work. Complex layouts need manual adjustment.
Building new sites from scraped data 75-80% The real use case. Not cloning, but building fresh from extracted content.
Visual clone of original site 30-40% Not the goal. By design. We liberate content, not clone designs.

Quick Reference: All Commands

Command What It Does
./start.sh setup Create new project
./start.sh discover URL –max-pages=50 Find all URLs
./start.sh audit jobs/site/ NOPE detector (GATE)
./start.sh login jobs/site/ Capture auth session (optional)
./start.sh curate jobs/site/ Classify URLs by type
./start.sh runner jobs/site/ –all Fetch all pages
./start.sh preview jobs/site/ –sample=5 Visual QA (GATE)
./start.sh tagger jobs/site/ Map selectors to tags
./start.sh scopes jobs/site/ Build scope.json with noculars
./start.sh extract jobs/site/ –clean Extract structured content
./start.sh validate jobs/site/ –extract Extraction QA (GATE)
./start.sh convert jobs/site/ –tag-groups Generate builder modules
./start.sh validate jobs/site/ –convert Conversion QA (GATE)
./start.sh import jobs/site/ –to=htmlbuilder_local/ Load into builder
./start.sh status jobs/site/ Check pipeline progress
./start.sh vault-export jobs/site/ WordPress CMS export