ACCID Pipeline

Test Checklist:

For a WordPress site:

bash

./start.sh discover https://some-wordpress-site.com --full-structure
# Should detect platform, pull categories/tags via REST API

For Dragon Lady (non-WordPress):

bash

./start.sh discover https://dragonladysf.com
./start.sh curate jobs/dragonladysf_com/
./start.sh tagger jobs/dragonladysf_com/
./start.sh preview jobs/dragonladysf_com/ --sample=3
./start.sh runner jobs/dragonladysf_com/ --all
./start.sh extract jobs/dragonladysf_com/ --clean
./start.sh convert jobs/dragonladysf_com/ --tag-groups

Key things to verify:

Resume works (run runner twice, second should skip completed)
Preview Excel shows screenshots + ✓/✗ for tags
Tag groups output has context_aware: true/false correctly
Failures don't kill the whole job

Show less

Site content extraction using Playwright.

Extract content from websites → organize by page type → output for HTML Builder.

Quick Start
Pipeline Overview
Step 1: Discover
Step 2: Curate
Step 3: Tagger
Step 4: Preview
Step 5: Runner
Step 6: Extract
Step 7: Convert
Action Reference
File Formats
Troubleshooting

Quick Start

chmod +x start.sh
./start.sh discover https://example.com
./start.sh curate jobs/example_com/
./start.sh tagger jobs/example_com/
./start.sh preview jobs/example_com/
./start.sh runner jobs/example_com/ --all
./start.sh extract jobs/example_com/
./start.sh convert jobs/example_com/

First run auto-installs: Playwright, Chromium, beautifulsoup4, openpyxl, pillow.

Pipeline Overview

DISCOVER → CURATE → TAGGER → PREVIEW → RUNNER → EXTRACT → CONVERT
   │         │         │        │         │         │         │
 Find      Organize  Define    Test     Fetch    Apply     Map to
 URLs      by type  selectors  first    pages    tags     modules

Step 1: Discover

Find all URLs on a site, optionally with full platform structure.

./start.sh discover URL [options]

Option	Default	What it does
`--max-pages=N`	200	Limit crawl
`--sitemap-only`	–	Only parse sitemaps
`--crawl-only`	–	Only follow links
`--full-structure`	–	Extract categories, tags, posts (WordPress)
`--output=FILE`	`jobs/{domain}/urls.json`	Custom output

Platform Detection

Automatically detects: WordPress, Shopify, Squarespace, Wix, Webflow, Ghost, Drupal, Astro, Next.js

Full Structure Mode (WordPress)

./start.sh discover https://wordpress-site.com --full-structure

Uses WordPress REST API to extract:

Categories with hierarchy
Tags
Authors
Posts mapped to taxonomies
Pages with parent/child relationships

This enables Vault to regenerate category pages, tag pages, archives – all the “dynamic” pages as static files.

Output with –full-structure:

{
  "site": {"platform": "wordpress"},
  "urls": [...],
  "structure": {
    "categories": {"tech": {"name": "Technology", "count": 45}},
    "tags": {"python": {"name": "Python", "count": 28}},
    "posts_by_category": {"tech": [{"title": "...", "slug": "..."}]},
    "posts_by_tag": {"python": [...]},
    "category_hierarchy": {"tech": {"children": ["ai", "web"]}}
  }
}

Examples:

./start.sh discover https://example.com
./start.sh discover https://example.com --max-pages=50
./start.sh discover https://wordpress-site.com --sitemap-only

Step 2: Curate

Organize URLs and assign page types.

./start.sh curate JOB_DIR [options]

Option	What it does
`--list`	Show all URLs
`--auto`	Auto-detect page types
`--add-url=URL`	Add URL manually
`--skip-pattern=PATTERN`	Skip matching URLs
`--set-type=TYPE --pattern=PATTERN`	Set type for matches

Interactive commands:

Command	What it does
`list`	Show all URLs
`list TYPE`	Show URLs of type
`add URL [type]`	Add URL
`skip N`	Skip URL #N
`unskip N`	Unskip URL #N
`type N TYPE`	Set type for URL #N
`skip-pattern PATTERN`	Skip all matching
`type-pattern PATTERN TYPE`	Set type for all matching
`auto`	Auto-detect types
`summary`	Show type counts
`save`	Save
`quit`	Save and exit

Example session:

./start.sh curate jobs/example_com/

curator> auto
curator> type-pattern /gallery/ gallery
curator> skip-pattern /old-blog/
curator> summary
curator> save
curator> quit

Step 3: Tagger

Define CSS selectors for content.

./start.sh tagger JOB_DIR [options]

Option	What it does
`--show`	Show current tags
`--edit`	Open tags.json in editor
`--add NAME SELECTOR`	Add global tag
`--add-to TYPE NAME SELECTOR`	Add page-type tag
`--remove NAME [TYPE]`	Remove tag

Interactive commands:

Command	What it does
`add NAME SELECTOR`	Add global tag
`add NAME SELECTOR TYPE`	Add to page type
`remove NAME [TYPE]`	Remove tag
`show`	Display all tags
`edit`	Open in editor
`quit`	Exit

Example:

./start.sh tagger jobs/example_com/

tagger> add nav .main-navigation
tagger> add hero .hero-section homepage
tagger> add content .entry-content article
tagger> add gallery_grid .gallery-container gallery
tagger> show
tagger> quit

Selector quick reference:

Pattern	Matches
`#myId`	ID
`.myClass`	Class
`.class1.class2`	Multiple classes
`.parent .child`	Nested
`.parent > .child`	Direct child
`nav, .navigation`	Either (fallback)
`[data-type="hero"]`	Attribute

Step 4: Preview

Visual QA before full run.

./start.sh preview JOB_DIR [options]

Option	Default	What it does
`--sample=N`	5	Pages to preview
`--type=TYPE`	all	Only this type
`--urls=URL1,URL2`	–	Specific URLs

Output: preview/preview.xlsx with screenshots + extraction results per tag.

What to check:

✓ = selector found content
✗ = selector found nothing (fix selector)
Too much text = selector too broad
Missing content = selector too narrow

Step 5: Runner

Fetch pages with Playwright.

./start.sh runner JOB_DIR [options]

Option	Default	What it does
`--action-set=NAME`	basic	Which actions to run
`--type=TYPE`	–	Only this page type
`--all`	–	All non-skipped URLs
`--test-url=URL`	–	Single URL test
`--retry-failed`	–	Retry previously failed URLs
`--reset`	–	Clear fetched.json, start fresh

Resume support: Runner automatically skips already-completed URLs. Just run again to continue where you left off.

Built-in action sets:

Name	What it does
`basic`	Load + dismiss cookies
`homepage`	Load + hover nav
`article`	Load + click “read more”
`scroll_load`	Scroll 3x (lazy loading)
`age_gate`	Handle age verification
`gallery_modal`	Click through gallery images
`nav_hover`	Extract dropdown menus
`click_modal`	Click and capture modals
`pixel_click`	Click at coordinates

Example workflow:

# Different action sets per page type
./start.sh runner jobs/example_com/ --action-set=homepage --type=homepage
./start.sh runner jobs/example_com/ --action-set=article --type=article  
./start.sh runner jobs/example_com/ --action-set=gallery_modal --type=gallery

# Or just basic for everything
./start.sh runner jobs/example_com/ --all

Step 6: Extract

Apply tags to saved HTML.

./start.sh extract JOB_DIR [options]

Option	What it does
`--clean`	Remove scripts, tracking, hidden elements

For each tag, extracts:

found – true/false
html – Raw HTML
text – Text only
links – Array of {text, href}
images – Array of {src, alt}
videos – Array of {src, type}

Step 7: Convert

Map tags to HTML Builder modules or generate semantic tag groups.

./start.sh convert JOB_DIR [options]

Option	What it does
`--interactive`	Prompt for each tag
`--mapping=TAG:MOD,...`	Set from command line
`--tag-groups`	Generate semantic tag groups (new format)

Output Formats

Default: Module-based (htmlbuilder_import.json)

{
  "pages": [
    {
      "name": "About",
      "modules": [
        {"type": "navigation", "data": {...}},
        {"type": "text", "data": {...}}
      ]
    }
  ]
}

Tag Groups (--tag-groups → tag_groups.json)

{
  "tag_groups": {
    "sitename-nav": {
      "context_aware": false,
      "value": {"items": [...]}
    },
    "sitename-h1s": {
      "context_aware": true,
      "by_page": {
        "home": {"value": "Welcome"},
        "about": {"value": "About Us"}
      }
    }
  }
}

When to Use Each

Format	Use When
Module-based	Building static pages, each page independent
Tag Groups	Building dynamic pages, content varies by context

Tag Groups: Context Awareness

Global tags (same on all pages):

nav, navigation, footer, sidebar, header

Page-specific tags (vary per page):

hero, content, page_header, title, gallery items

The converter auto-detects context-awareness by checking if values differ across pages.

Module types:

Type	For
`navigation`	Nav menus
`hero`	Hero sections
`text`	Content blocks
`image`	Single images
`gallery`	Image galleries
`cards`	Card grids
`video`	Video embeds
`footer`	Footers
`html`	Raw HTML fallback
`skip`	Exclude

Action Reference

Actions are JSON objects with "do": "action_name". Used in action sets.

Action	Example	What it does
`goto`	`{"do": "goto", "wait": "networkidle"}`	Load page
`goto`	`{"do": "goto", "url": "https://...", "wait": "load"}`	Load specific URL
`wait`	`{"do": "wait", "seconds": 2}`	Wait N seconds

Wait conditions: load, domcontentloaded, networkidle

Mouse Actions

Action	Example	What it does
`click`	`{"do": "click", "selector": ".btn"}`	Click element
`click`	`{"do": "click", "selector": ".maybe", "optional": true}`	Click if exists
`hover`	`{"do": "hover", "selector": "nav"}`	Hover element
`click_xy`	`{"do": "click_xy", "x": 960, "y": 540}`	Click coordinates
`hover_xy`	`{"do": "hover_xy", "x": 100, "y": 50}`	Hover coordinates

Input

Action	Example	What it does
`type`	`{"do": "type", "selector": "#search", "text": "query"}`	Type text
`press`	`{"do": "press", "key": "Enter"}`	Press key

Scrolling

Action	Example	What it does
`scroll`	`{"do": "scroll", "to": "bottom"}`	Scroll to bottom
`scroll`	`{"do": "scroll", "to": "top"}`	Scroll to top
`scroll`	`{"do": "scroll", "to": 500}`	Scroll to pixel

Content Extraction

Action	What it does
`extract`	Save page HTML
`hover_extract`	Hover, capture what appears
`click_extract`	Click, capture modal/overlay
`extract_gallery`	Click through gallery, get all images

hover_extract:

{
  "do": "hover_extract",
  "selector": ".nav-item",
  "wait": 0.5,
  "extract": ".dropdown-menu",
  "name": "nav_dropdown"
}

click_extract:

{
  "do": "click_extract",
  "selector": ".thumbnail",
  "wait": 1,
  "extract": ".modal",
  "close": ".modal-close",
  "name": "modal_content"
}

extract_gallery:

{
  "do": "extract_gallery",
  "config": {
    "first_item": ".gallery-item",
    "modal": "#imageModal",
    "next_btn": ".modal-next",
    "close_btn": ".modal-close",
    "image": "#modalImage",
    "caption": "#modalCaption",
    "counter": "#modalCounter",
    "max_items": 100
  }
}

Debug

Action	Example	What it does
`screenshot`	`{"do": "screenshot", "name": "step1"}`	Save screenshot
`eval`	`{"do": "eval", "js": "..."}`	Run JavaScript

Viewport Config

Set in action set (for pixel-accurate clicking):

{
  "name": "my_action_set",
  "viewport_width": 1920,
  "viewport_height": 1080,
  "device_scale": 1,
  "actions": [...]
}

Custom Action Set Example

Save as jobs/{domain}/action-sets/my_custom.json:

{
  "name": "wordpress_elementor",
  "description": "Handle Elementor sites",
  "actions": [
    {"do": "goto", "wait": "networkidle"},
    {"do": "click", "selector": ".elementor-popup .close", "optional": true},
    {"do": "wait", "seconds": 1},
    {"do": "scroll", "to": "bottom"},
    {"do": "wait", "seconds": 2},
    {"do": "extract"}
  ]
}

File Formats

urls.json

{
  "site": {"url": "...", "domain": "..."},
  "urls": [
    {"url": "...", "title": "...", "page_type": "article", "skip": false}
  ]
}

tags.json

{
  "global": {
    "nav": ".selector"
  },
  "page_types": {
    "article": {
      "content": ".selector"
    }
  }
}

Action Set

{
  "name": "...",
  "viewport_width": 1400,
  "viewport_height": 900,
  "actions": [
    {"do": "goto"},
    {"do": "extract"}
  ]
}

Troubleshooting

Problem	Fix
“No module named playwright”	`rm -rf venv` then run again
Selector finds nothing	Test in browser: `document.querySelector('.x')`
Gallery loops forever	Set `max_items` in config
Pixel clicks miss	Match viewport to browser exactly
Astro/React breaks navigation	Use `wait` between actions
Permission denied	`chmod +x start.sh`
Job stopped partway	Just run again – auto-resumes
Want to re-fetch everything	`./start.sh runner JOB --reset --all`
Some pages failed	`./start.sh runner JOB --retry-failed`

Full Example: Dragon Lady Site

# 1. Find URLs
./start.sh discover https://dragonladysf.com

# 2. Organize
./start.sh curate jobs/dragonladysf_com/
# type-pattern /gallery/ gallery
# type-pattern /explore/ gallery
# type 0 homepage

# 3. Define selectors
./start.sh tagger jobs/dragonladysf_com/
# add nav .vertical-nav
# add hero .hero lowerlevel
# add content .full-content lowerlevel
# add gallery_grid .gallery-container gallery
# add modal #imageModal gallery

# 4. Preview
./start.sh preview jobs/dragonladysf_com/ --sample=3

# 5. Fetch (different action sets per type)
./start.sh runner jobs/dragonladysf_com/ --action-set=basic --type=homepage
./start.sh runner jobs/dragonladysf_com/ --action-set=basic --type=lowerlevel
./start.sh runner jobs/dragonladysf_com/ --action-set=gallery_modal --type=gallery

# 6. Extract
./start.sh extract jobs/dragonladysf_com/ --clean

# 7. Convert
./start.sh convert jobs/dragonladysf_com/ --interactive

Files in Package

accid/
├── start.sh           # Entry point (handles venv + deps)
├── README.md          # This file
├── discover.py        # URL discovery + platform detection
├── platforms.py       # Platform-specific structure extraction
├── curate.py          # URL organization
├── tagger.py          # Selector management
├── preview.py         # Visual QA
├── runner.py          # Page fetching
├── extract.py         # Content extraction
├── convert.py         # Module mapping + tag groups
├── vault_export.py    # Vault-ready export generation
├── example_tags.json  # Sample tags file
├── example_tag_groups.json  # Sample tag groups output
└── action-sets/       # Built-in action sets
    ├── gallery_modal.json
    ├── nav_hover.json
    ├── click_modal.json
    └── pixel_click.json

Vault Export

Generate everything Vault needs to rebuild “dynamic” pages as static files.

./start.sh vault-export jobs/example_com/

Requirements

Run discover with --full-structure:

./start.sh discover https://site.com --full-structure

Run convert with --tag-groups:

./start.sh convert jobs/site_com/ --tag-groups

Generate Vault export:
```
python vault_export.py jobs/site_com/
```

What It Generates

Content	Description
`tag_groups`	Context-aware content bindings
`taxonomies`	Categories, tags, authors
`relationships`	Posts mapped to categories/tags/authors
`generated_pages`	Category archives, tag archives, date archives
`navigation`	Main nav, category nav, footer
`search_index`	Full-text search data

Generated Archive Pages

For a WordPress site with 5 categories and 10 tags, Vault export creates:

5 category archive pages (/category/tech/, /category/news/, etc.)
10 tag archive pages (/tag/python/, /tag/javascript/, etc.)
Author archive pages (/author/john-doe/)
Year archives (/archive/2024/)
Month archives (/archive/2024/01/)

Each archive page includes:

Page metadata (title, description, post count)
List of posts in that archive
Template hint for rendering

How Vault Uses This

// In Vault: Load the export
const vaultData = await fetch('/vault_export.json').then(r => r.json());

// Get all posts in a category
const techPosts = vaultData.relationships.posts_by_category['tech'];

// Generate static category page
const categoryPage = {
  title: vaultData.taxonomies.categories['tech'].name,
  posts: techPosts,
  template: 'archive'
};

// Context-aware content
const pageH1 = vaultData.tag_groups['site-h1s'].by_page[currentPage];

The Key Insight

No database needed. All relationships are pre-computed:

Post → Categories mapping: posts_by_category
Post → Tags mapping: posts_by_tag
Category hierarchy: category_hierarchy
Search index: search_index

Vault just reads JSON and renders. “Dynamic” pages are actually static files generated from this data.

Test Checklist:

Table of Contents

Quick Start

Pipeline Overview

Step 1: Discover

Platform Detection

Full Structure Mode (WordPress)

Step 2: Curate

Step 3: Tagger

Step 4: Preview

Step 5: Runner

Step 6: Extract

Step 7: Convert

Output Formats

When to Use Each

Tag Groups: Context Awareness

Action Reference

Navigation & Timing

Mouse Actions

Input

Scrolling

Content Extraction

Debug

Viewport Config

Custom Action Set Example

File Formats

urls.json

tags.json

Action Set

Troubleshooting

Full Example: Dragon Lady Site

Files in Package

Vault Export

Requirements

What It Generates

Generated Archive Pages

How Vault Uses This

The Key Insight

Test Checklist:

Table of Contents

Quick Start

Pipeline Overview

Step 1: Discover

Platform Detection

Full Structure Mode (WordPress)

Step 2: Curate

Step 3: Tagger

Step 4: Preview

Step 5: Runner

Step 6: Extract

Step 7: Convert

Output Formats

When to Use Each

Tag Groups: Context Awareness

Action Reference

Navigation & Timing

Mouse Actions

Input

Scrolling

Content Extraction

Debug

Viewport Config

Custom Action Set Example

File Formats

urls.json

tags.json

Action Set

Troubleshooting

Full Example: Dragon Lady Site

Files in Package

Vault Export

Requirements

What It Generates

Generated Archive Pages

How Vault Uses This

The Key Insight

Related Posts