+ JSON-noculars
Complete System Documentation
Version 2.0 — February 11, 2026
What This System Does
The ACCID Scraper Pipeline takes any website, extracts its content (text, links, images, navigation), and converts it into a format the ACCID HTML Builder can import. The output is a static HTML site with the original content. No databases, no APIs, no frameworks.
What it is NOT: This is not a visual clone tool. We don't recreate the source site's layout, fonts, or design. We extract CONTENT so the user can start fresh with it. Think of it as a content liberation tool.
The pitch: "Your content is trapped in WordPress behind PHP, MySQL, themes, plugins, and a hosting bill. We pull it all out, hand it back as clean static HTML. It loads in 50ms, costs nothing to host, never breaks, never needs updating. Plus you get tag-based search and dynamic category pages for free."
Pipeline Overview
15 steps across 6 phases, with 3 gate checkpoints and 5 JSON-noculars integration points.
| Phase | Steps | Purpose |
|---|---|---|
| RECON | 1-4: setup, discover, audit, login | Understand the target before touching it |
| CAPTURE | 5-7: curate, runner, preview | Get the raw HTML, fully rendered |
| SCOPE | 8-9: tagger, scopes | Define what to extract (noculars lives here) |
| EXTRACT | 10-11: extract, validate | Apply scopes to HTML, get structured data |
| CONVERT | 12-13: convert, validate | Transform data into builder modules |
| DELIVER | 14-15: import, export | Into the builder, ship the static site |
Every Step in Detail
Phase: RECON
| Step 1 | setup | Project Wizard |
|---|---|---|
| What | Create job folder, choose extraction path (Full Access vs Scrape Only) | |
| CLI | ./start.sh setup | |
| Outputs | jobs/{site_name}/config.json |
| Step 2 | discover | Crawl & Map URLs |
|---|---|---|
| What | Find all URLs via sitemap.xml, robots.txt, and link crawling. Detect platform. | |
| CLI | ./start.sh discover https://example.com –max-pages=50 | |
| Outputs | jobs/{site_name}/urls.json |
| Step 3 | audit | NOPE Detector |
|---|---|---|
| What | Asset analysis, complexity score 1-10, SPA detection, auth detection, CDN analysis. | |
| CLI | ./start.sh audit jobs/example_com/ | |
| Outputs | jobs/{site_name}/audit.json | |
| GATE | If score > 7 or SPA detected, stop and assess. If auth required, run login before runner. |
Asset analysis, complexity score 1-10, SPA detection, auth detection, CDN analysis.
| Step 4 | login | Auth Capture (Optional) |
|---|---|---|
| What | Opens visible browser for manual login. Captures cookies + localStorage. Runner reuses session. | |
| CLI | ./start.sh login jobs/example_com/ | |
| Outputs | jobs/{site_name}/auth/storage_state.json |
Opens visible browser for manual login. Captures cookies + localStorage. Runner reuses session.
Phase: CAPTURE
| Step 5 | curate | Review & Classify URLs |
|---|---|---|
| What | Assign page types: homepage, post, page, category, gallery. Skip unwanted URLs. | |
| CLI | ./start.sh curate jobs/example_com/ | |
| Outputs | jobs/{site_name}/urls.json (updated) |
| Step 6 | runner | Fetch All Pages |
|---|---|---|
| What | Playwright fetches every URL. Handles JS rendering, cookies, lazy loading, action sets. | |
| CLI | ./start.sh runner jobs/example_com/ –all | |
| Outputs | jobs/{site_name}/html/*.html jobs/{site_name}/fetched.json |
Playwright fetches every URL. Handles JS rendering, cookies, lazy loading, action sets.
| Step 7 | preview | Visual QA |
|---|---|---|
| What | Screenshots + Excel report. Verify we captured real content, not blank pages. | |
| CLI | ./start.sh preview jobs/example_com/ –sample=5 | |
| Outputs | jobs/{site_name}/preview/report.xlsx | |
| GATE | Review screenshots. If pages are blank or blocked, re-run runner with login or different settings. |
Phase: SCOPE
| Step 8 | tagger | Tag Elements |
|---|---|---|
| What | Map CSS selectors to tag names. This is WHERE you click. | |
| CLI | ./start.sh tagger jobs/example_com/ | |
| Outputs | jobs/{site_name}/tags.json | |
| { } | NOCULARS: USE HERE After tagging, paste sample HTML into JSON-noculars to verify each tag contains the expected data (links, images, text). |
| Step 9 | scopes | Scope & Clean |
|---|---|---|
| What | Write scope.json with inner_selector, strip_selectors, regex cleanup. This IS the noculars workflow. | |
| CLI | ./start.sh scopes jobs/example_com/ | |
| Outputs | jobs/{site_name}/scopes.json | |
| { } | NOCULARS: PRIMARY Starts local server on port 9860, opens noculars in browser. Paste HTML, X-ray, click, name, Add to Scope, Download scopes.json. |
Write scope.json with inner_selector, strip_selectors, regex cleanup. This IS the noculars workflow.
Phase: EXTRACT
| Step 10 | extract | Pull Structured Content |
|---|---|---|
| What | Apply tags + scopes to every page. Extract text, html, links, images per tag. | |
| CLI | ./start.sh extract jobs/example_com/ –clean | |
| Outputs | jobs/{site_name}/extracted.json | |
| { } | NOCULARS: VERIFY Load extracted.json in JSON mode. Verify nav has links, content has clean text, gallery has images. If wrong, adjust scopes. |
| Step 11 | validate –extract | Extraction QA |
|---|---|---|
| What | Automated checks: missing tags, empty fields, [object Object] values, broken image URLs. | |
| CLI | ./start.sh validate jobs/example_com/ –extract | |
| Outputs | jobs/{site_name}/extraction_report.json | |
| GATE | If >10% of pages have issues, fix scopes (step 9) and re-extract. | |
| { } | NOCULARS: DEBUG When issues flagged, load problem page data in JSON mode to diagnose. Load raw HTML in HTML mode to see why scope missed it. |
Automated checks: missing tags, empty fields, [object Object] values, broken image URLs.
Phase: CONVERT
| Step 12 | convert | Map Data to Modules |
|---|---|---|
| What | Generate htmlbuilder_import.json + tag_groups.json. Flatten nested data. Assemble pages. | |
| CLI | ./start.sh convert jobs/example_com/ –tag-groups | |
| Outputs | jobs/{site_name}/htmlbuilder_import.json jobs/{site_name}/tag_groups.json jobs/{site_name}/accid_dropper.json | |
| { } | NOCULARS: VERIFY Load htmlbuilder_import.json in JSON mode. Verify modules have correct content, no [object Object]. |
Generate htmlbuilder_import.json + tag_groups.json. Flatten nested data. Assemble pages.
| Step 13 | validate –convert | Conversion QA |
|---|---|---|
| What | Round-trip check: does converted data match extraction? Flags lost text, links, images. | |
| CLI | ./start.sh validate jobs/example_com/ –convert | |
| Outputs | jobs/{site_name}/conversion_report.json | |
| GATE | If content was lost in conversion, fix mapping rules and re-run. |
Round-trip check: does converted data match extraction? Flags lost text, links, images.
Phase: DELIVER
| Step 14 | import | Load into Builder |
|---|---|---|
| What | Copy accid_dropper.json + tag_groups.json into htmlbuilder_local/. | |
| CLI | ./start.sh import jobs/example_com/ –to=htmlbuilder_local/ | |
| Outputs | htmlbuilder_local/accid_dropper.json |
| Step 15 | export | Build & Ship |
|---|---|---|
| What | Arrange layout in builder, export as static HTML. ZIP download or FTP upload. | |
| CLI | (In builder UI) | |
| Outputs | Final static HTML site |
JSON-noculars
Visual HTML and JSON data inspection tool. Think of it as X-ray goggles for web data. It lets you see what data lives inside HTML elements or JSON structures before writing extraction rules.
Two Modes
HTML Mode: Paste raw HTML. It renders in a preview pane. X-ray mode shows data badges on hover — how many links, images, and characters of text each element contains. Click to select. Name the tag. Add to scope. Build up your entire scope.json visually.
JSON Mode: Paste JSON data (database exports, extracted.json, API responses). Renders as an interactive tree with type analysis. Every node shows counts: strings, numbers, URLs, images, arrays, objects. Click any node to select it for scope output.
Scope Library
Eight built-in patterns cover common extraction targets: Navigation Links, Hero Section, Gallery Images, Article Content, Footer Links, Product Card, Social Links, Meta/SEO. Click "Use" to apply any pattern to your current scope.
Your own patterns persist in browser localStorage. When you find a scope pattern that works well, save it with a name and description. Next time you scrape a similar site, it's one click to reuse it.
How It Integrates
The scopes command (./start.sh scopes jobs/example_com/**) starts a local Python HTTP server on port 9860, finds a sample HTML file from the job folder, and opens JSON-noculars in your default browser with the file pre-loaded. You can also drag and drop files onto the window.**
Workflow
- Hover elements to see data badges (links, images, text length)
- Click to select — orange border marks the selection
- Type a tag name (nav, hero, content, gallery, footer)
- Click "+ Scope" to add to the accumulator
- Repeat for all sections of the page
- Click "Download" to save scopes.json
- Move scopes.json into your job folder
Converter: Tag to Module Mapping
Step 12 (convert) is the bridge between scraper and builder. These are the rules for converting extracted tags into HTML Builder modules.
| Tag Names | Module Type | Data Shape | Layout | Order |
|---|---|---|---|---|
| nav, header | NavigationModule | {items: [{label, url}]} | card-full | 0 |
| hero, banner | HeroModule | {headline, subtext, backgroundImage} | hero-full | 10 |
| content, text, article | TextModule | {content: html_string} | card-full | 20 |
| gallery, images | GalleryModule | {images: [{src, alt, caption}]} | card-wide | 30 |
| cards, grid | CardsModule | {items: […]} | card-wide | 25 |
| footer | TextModule | {content: html_string} | card-full | 100 |
| meta, seo | MetaModule | {title, description} | hidden | -1 |
The [object Object] Killer
The number one bug in the pipeline. Extracted values are often nested objects like {text, html, links, images} instead of plain strings. If you pass these directly to a module, it renders as [object Object] in the browser.
The fix: Always resolve nested objects to their appropriate primitive field. TextModule gets .html (preferred) or .text. NavigationModule gets .links mapped to {label, url}. GalleryModule gets .images array directly. Never pass the whole object.
Page Assembly Order
Each generated page gets modules in this order: MetaModule (hidden) → NavigationModule (showOnAllPages, card-full) → HeroModule (hero-full) → TextModule/content (card-full) → GalleryModule (card-wide) → TextModule/footer (showOnAllPages, card-full).
Known Gotchas
[object Object]: Nested objects passed to modules render as literal text. Flatten in converter step.
SPA Sites: React/Vue/Angular apps may serve empty HTML. Runner uses Playwright for JS execution, but some apps need extra wait time. Audit detects this early.
CDN Images: Images on Cloudflare, imgix, wp.com may block hotlinking or use expiring URLs. Download locally during runner, rewrite URLs in convert.
Relative URLs: Links like /images/hero.jpg need the source site's base URL prepended during conversion.
Cookie Banners: Captured as page content. Use action sets that click dismiss, or add banner selectors to strip_selectors in scopes.
Every Site Is Different: No one-size-fits-all extraction. .post-body vs article.entry-content vs
Expected Success Rates
| Task | Rate | Notes |
|---|---|---|
| Content extraction from sites | 85-90% | Remaining 10-15% is JS-rendered content, auth walls, CDN blocking |
| Data to module mapping | 60-65% | Standard content types work. Complex layouts need manual adjustment. |
| Building new sites from scraped data | 75-80% | The real use case. Not cloning, but building fresh from extracted content. |
| Visual clone of original site | 30-40% | Not the goal. By design. We liberate content, not clone designs. |
Quick Reference: All Commands
| Command | What It Does |
|---|---|
| ./start.sh setup | Create new project |
| ./start.sh discover URL –max-pages=50 | Find all URLs |
| ./start.sh audit jobs/site/ | NOPE detector (GATE) |
| ./start.sh login jobs/site/ | Capture auth session (optional) |
| ./start.sh curate jobs/site/ | Classify URLs by type |
| ./start.sh runner jobs/site/ –all | Fetch all pages |
| ./start.sh preview jobs/site/ –sample=5 | Visual QA (GATE) |
| ./start.sh tagger jobs/site/ | Map selectors to tags |
| ./start.sh scopes jobs/site/ | Build scope.json with noculars |
| ./start.sh extract jobs/site/ –clean | Extract structured content |
| ./start.sh validate jobs/site/ –extract | Extraction QA (GATE) |
| ./start.sh convert jobs/site/ –tag-groups | Generate builder modules |
| ./start.sh validate jobs/site/ –convert | Conversion QA (GATE) |
| ./start.sh import jobs/site/ –to=htmlbuilder_local/ | Load into builder |
| ./start.sh status jobs/site/ | Check pipeline progress |
| ./start.sh vault-export jobs/site/ | WordPress CMS export |
