ACCID Scraper Pipeline

+ JSON-noculars

Complete System Documentation

Version 2.0 — February 11, 2026

What This System Does

The ACCID Scraper Pipeline takes any website, extracts its content (text, links, images, navigation), and converts it into a format the ACCID HTML Builder can import. The output is a static HTML site with the original content. No databases, no APIs, no frameworks.

What it is NOT: This is not a visual clone tool. We don't recreate the source site's layout, fonts, or design. We extract CONTENT so the user can start fresh with it. Think of it as a content liberation tool.

The pitch: "Your content is trapped in WordPress behind PHP, MySQL, themes, plugins, and a hosting bill. We pull it all out, hand it back as clean static HTML. It loads in 50ms, costs nothing to host, never breaks, never needs updating. Plus you get tag-based search and dynamic category pages for free."

Pipeline Overview

15 steps across 6 phases, with 3 gate checkpoints and 5 JSON-noculars integration points.

PhaseStepsPurpose
RECON1-4: setup, discover, audit, loginUnderstand the target before touching it
CAPTURE5-7: curate, runner, previewGet the raw HTML, fully rendered
SCOPE8-9: tagger, scopesDefine what to extract (noculars lives here)
EXTRACT10-11: extract, validateApply scopes to HTML, get structured data
CONVERT12-13: convert, validateTransform data into builder modules
DELIVER14-15: import, exportInto the builder, ship the static site

Every Step in Detail

Phase: RECON

Step 1setupProject Wizard
WhatCreate job folder, choose extraction path (Full Access vs Scrape Only) 
CLI./start.sh setup 
Outputsjobs/{site_name}/config.json 
Step 2discoverCrawl & Map URLs
WhatFind all URLs via sitemap.xml, robots.txt, and link crawling. Detect platform. 
CLI./start.sh discover https://example.com –max-pages=50 
Outputsjobs/{site_name}/urls.json 
Step 3auditNOPE Detector
WhatAsset analysis, complexity score 1-10, SPA detection, auth detection, CDN analysis. 
CLI./start.sh audit jobs/example_com/ 
Outputsjobs/{site_name}/audit.json 
GATEIf score > 7 or SPA detected, stop and assess. If auth required, run login before runner. 

Asset analysis, complexity score 1-10, SPA detection, auth detection, CDN analysis.

Step 4loginAuth Capture (Optional)
WhatOpens visible browser for manual login. Captures cookies + localStorage. Runner reuses session. 
CLI./start.sh login jobs/example_com/ 
Outputsjobs/{site_name}/auth/storage_state.json 

Opens visible browser for manual login. Captures cookies + localStorage. Runner reuses session.

Phase: CAPTURE

Step 5curateReview & Classify URLs
WhatAssign page types: homepage, post, page, category, gallery. Skip unwanted URLs. 
CLI./start.sh curate jobs/example_com/ 
Outputsjobs/{site_name}/urls.json (updated) 
Step 6runnerFetch All Pages
WhatPlaywright fetches every URL. Handles JS rendering, cookies, lazy loading, action sets. 
CLI./start.sh runner jobs/example_com/ –all 
Outputsjobs/{site_name}/html/*.html jobs/{site_name}/fetched.json 

Playwright fetches every URL. Handles JS rendering, cookies, lazy loading, action sets.

Step 7previewVisual QA
WhatScreenshots + Excel report. Verify we captured real content, not blank pages. 
CLI./start.sh preview jobs/example_com/ –sample=5 
Outputsjobs/{site_name}/preview/report.xlsx 
GATEReview screenshots. If pages are blank or blocked, re-run runner with login or different settings. 

Phase: SCOPE

Step 8taggerTag Elements
WhatMap CSS selectors to tag names. This is WHERE you click. 
CLI./start.sh tagger jobs/example_com/ 
Outputsjobs/{site_name}/tags.json 
{ }NOCULARS: USE HERE After tagging, paste sample HTML into JSON-noculars to verify each tag contains the expected data (links, images, text). 
Step 9scopesScope & Clean
WhatWrite scope.json with inner_selector, strip_selectors, regex cleanup. This IS the noculars workflow. 
CLI./start.sh scopes jobs/example_com/ 
Outputsjobs/{site_name}/scopes.json 
{ }NOCULARS: PRIMARY Starts local server on port 9860, opens noculars in browser. Paste HTML, X-ray, click, name, Add to Scope, Download scopes.json. 

Write scope.json with inner_selector, strip_selectors, regex cleanup. This IS the noculars workflow.

Phase: EXTRACT

Step 10extractPull Structured Content
WhatApply tags + scopes to every page. Extract text, html, links, images per tag. 
CLI./start.sh extract jobs/example_com/ –clean 
Outputsjobs/{site_name}/extracted.json 
{ }NOCULARS: VERIFY Load extracted.json in JSON mode. Verify nav has links, content has clean text, gallery has images. If wrong, adjust scopes. 
Step 11validate –extractExtraction QA
WhatAutomated checks: missing tags, empty fields, [object Object] values, broken image URLs. 
CLI./start.sh validate jobs/example_com/ –extract 
Outputsjobs/{site_name}/extraction_report.json 
GATEIf >10% of pages have issues, fix scopes (step 9) and re-extract. 
{ }NOCULARS: DEBUG When issues flagged, load problem page data in JSON mode to diagnose. Load raw HTML in HTML mode to see why scope missed it. 

Automated checks: missing tags, empty fields, [object Object] values, broken image URLs.

Phase: CONVERT

Step 12convertMap Data to Modules
WhatGenerate htmlbuilder_import.json + tag_groups.json. Flatten nested data. Assemble pages. 
CLI./start.sh convert jobs/example_com/ –tag-groups 
Outputsjobs/{site_name}/htmlbuilder_import.json jobs/{site_name}/tag_groups.json jobs/{site_name}/accid_dropper.json 
{ }NOCULARS: VERIFY Load htmlbuilder_import.json in JSON mode. Verify modules have correct content, no [object Object]. 

Generate htmlbuilder_import.json + tag_groups.json. Flatten nested data. Assemble pages.

Step 13validate –convertConversion QA
WhatRound-trip check: does converted data match extraction? Flags lost text, links, images. 
CLI./start.sh validate jobs/example_com/ –convert 
Outputsjobs/{site_name}/conversion_report.json 
GATEIf content was lost in conversion, fix mapping rules and re-run. 

Round-trip check: does converted data match extraction? Flags lost text, links, images.

Phase: DELIVER

Step 14importLoad into Builder
WhatCopy accid_dropper.json + tag_groups.json into htmlbuilder_local/. 
CLI./start.sh import jobs/example_com/ –to=htmlbuilder_local/ 
Outputshtmlbuilder_local/accid_dropper.json 
Step 15exportBuild & Ship
WhatArrange layout in builder, export as static HTML. ZIP download or FTP upload. 
CLI(In builder UI) 
OutputsFinal static HTML site 

JSON-noculars

Visual HTML and JSON data inspection tool. Think of it as X-ray goggles for web data. It lets you see what data lives inside HTML elements or JSON structures before writing extraction rules.

Two Modes

HTML Mode: Paste raw HTML. It renders in a preview pane. X-ray mode shows data badges on hover — how many links, images, and characters of text each element contains. Click to select. Name the tag. Add to scope. Build up your entire scope.json visually.

JSON Mode: Paste JSON data (database exports, extracted.json, API responses). Renders as an interactive tree with type analysis. Every node shows counts: strings, numbers, URLs, images, arrays, objects. Click any node to select it for scope output.

Scope Library

Eight built-in patterns cover common extraction targets: Navigation Links, Hero Section, Gallery Images, Article Content, Footer Links, Product Card, Social Links, Meta/SEO. Click "Use" to apply any pattern to your current scope.

Your own patterns persist in browser localStorage. When you find a scope pattern that works well, save it with a name and description. Next time you scrape a similar site, it's one click to reuse it.

How It Integrates

The scopes command (./start.sh scopes jobs/example_com/**) starts a local Python HTTP server on port 9860, finds a sample HTML file from the job folder, and opens JSON-noculars in your default browser with the file pre-loaded. You can also drag and drop files onto the window.**

Workflow

  1. Hover elements to see data badges (links, images, text length)
  2. Click to select — orange border marks the selection
  3. Type a tag name (nav, hero, content, gallery, footer)
  4. Click "+ Scope" to add to the accumulator
  5. Repeat for all sections of the page
  6. Click "Download" to save scopes.json
  7. Move scopes.json into your job folder

Converter: Tag to Module Mapping

Step 12 (convert) is the bridge between scraper and builder. These are the rules for converting extracted tags into HTML Builder modules.

Tag NamesModule TypeData ShapeLayoutOrder
nav, headerNavigationModule{items: [{label, url}]}card-full0
hero, bannerHeroModule{headline, subtext, backgroundImage}hero-full10
content, text, articleTextModule{content: html_string}card-full20
gallery, imagesGalleryModule{images: [{src, alt, caption}]}card-wide30
cards, gridCardsModule{items: […]}card-wide25
footerTextModule{content: html_string}card-full100
meta, seoMetaModule{title, description}hidden-1

The [object Object] Killer

The number one bug in the pipeline. Extracted values are often nested objects like {text, html, links, images} instead of plain strings. If you pass these directly to a module, it renders as [object Object] in the browser.

The fix: Always resolve nested objects to their appropriate primitive field. TextModule gets .html (preferred) or .text. NavigationModule gets .links mapped to {label, url}. GalleryModule gets .images array directly. Never pass the whole object.

Page Assembly Order

Each generated page gets modules in this order: MetaModule (hidden) → NavigationModule (showOnAllPages, card-full) → HeroModule (hero-full) → TextModule/content (card-full) → GalleryModule (card-wide) → TextModule/footer (showOnAllPages, card-full).

Known Gotchas

[object Object]: Nested objects passed to modules render as literal text. Flatten in converter step.

SPA Sites: React/Vue/Angular apps may serve empty HTML. Runner uses Playwright for JS execution, but some apps need extra wait time. Audit detects this early.

CDN Images: Images on Cloudflare, imgix, wp.com may block hotlinking or use expiring URLs. Download locally during runner, rewrite URLs in convert.

Relative URLs: Links like /images/hero.jpg need the source site's base URL prepended during conversion.

Cookie Banners: Captured as page content. Use action sets that click dismiss, or add banner selectors to strip_selectors in scopes.

Every Site Is Different: No one-size-fits-all extraction. .post-body vs article.entry-content vs

. Human judgment writes selectors. JSON-noculars makes that judgment faster.

Expected Success Rates

TaskRateNotes
Content extraction from sites85-90%Remaining 10-15% is JS-rendered content, auth walls, CDN blocking
Data to module mapping60-65%Standard content types work. Complex layouts need manual adjustment.
Building new sites from scraped data75-80%The real use case. Not cloning, but building fresh from extracted content.
Visual clone of original site30-40%Not the goal. By design. We liberate content, not clone designs.

Quick Reference: All Commands

CommandWhat It Does
./start.sh setupCreate new project
./start.sh discover URL –max-pages=50Find all URLs
./start.sh audit jobs/site/NOPE detector (GATE)
./start.sh login jobs/site/Capture auth session (optional)
./start.sh curate jobs/site/Classify URLs by type
./start.sh runner jobs/site/ –allFetch all pages
./start.sh preview jobs/site/ –sample=5Visual QA (GATE)
./start.sh tagger jobs/site/Map selectors to tags
./start.sh scopes jobs/site/Build scope.json with noculars
./start.sh extract jobs/site/ –cleanExtract structured content
./start.sh validate jobs/site/ –extractExtraction QA (GATE)
./start.sh convert jobs/site/ –tag-groupsGenerate builder modules
./start.sh validate jobs/site/ –convertConversion QA (GATE)
./start.sh import jobs/site/ –to=htmlbuilder_local/Load into builder
./start.sh status jobs/site/Check pipeline progress
./start.sh vault-export jobs/site/WordPress CMS export