ACCID Pipeline

Site content extraction using Playwright.

Extract content from websites → organize by page type → output for HTML Builder.


Table of Contents

  1. Quick Start
  2. Pipeline Overview
  3. Step 1: Discover
  4. Step 2: Curate
  5. Step 3: Tagger
  6. Step 4: Preview
  7. Step 5: Runner
  8. Step 6: Extract
  9. Step 7: Convert
  10. Action Reference
  11. File Formats
  12. Troubleshooting

Quick Start

chmod +x start.sh
./start.sh discover https://example.com
./start.sh curate jobs/example_com/
./start.sh tagger jobs/example_com/
./start.sh preview jobs/example_com/
./start.sh runner jobs/example_com/ --all
./start.sh extract jobs/example_com/
./start.sh convert jobs/example_com/

First run auto-installs: Playwright, Chromium, beautifulsoup4, openpyxl, pillow.


Pipeline Overview

DISCOVER → CURATE → TAGGER → PREVIEW → RUNNER → EXTRACT → CONVERT
   │         │         │        │         │         │         │
 Find      Organize  Define    Test     Fetch    Apply     Map to
 URLs      by type  selectors  first    pages    tags     modules

Step 1: Discover

Find all URLs on a site, optionally with full platform structure.

./start.sh discover URL [options]
Option Default What it does
--max-pages=N 200 Limit crawl
--sitemap-only Only parse sitemaps
--crawl-only Only follow links
--full-structure Extract categories, tags, posts (WordPress)
--output=FILE jobs/{domain}/urls.json Custom output

Platform Detection

Automatically detects: WordPress, Shopify, Squarespace, Wix, Webflow, Ghost, Drupal, Astro, Next.js

Full Structure Mode (WordPress)

./start.sh discover https://wordpress-site.com --full-structure

Uses WordPress REST API to extract:

  • Categories with hierarchy
  • Tags
  • Authors
  • Posts mapped to taxonomies
  • Pages with parent/child relationships

This enables Vault to regenerate category pages, tag pages, archives – all the “dynamic” pages as static files.

Output with –full-structure:

{
  "site": {"platform": "wordpress"},
  "urls": [...],
  "structure": {
    "categories": {"tech": {"name": "Technology", "count": 45}},
    "tags": {"python": {"name": "Python", "count": 28}},
    "posts_by_category": {"tech": [{"title": "...", "slug": "..."}]},
    "posts_by_tag": {"python": [...]},
    "category_hierarchy": {"tech": {"children": ["ai", "web"]}}
  }
}

Examples:

./start.sh discover https://example.com
./start.sh discover https://example.com --max-pages=50
./start.sh discover https://wordpress-site.com --sitemap-only

Step 2: Curate

Organize URLs and assign page types.

./start.sh curate JOB_DIR [options]
Option What it does
--list Show all URLs
--auto Auto-detect page types
--add-url=URL Add URL manually
--skip-pattern=PATTERN Skip matching URLs
--set-type=TYPE --pattern=PATTERN Set type for matches

Interactive commands:

Command What it does
list Show all URLs
list TYPE Show URLs of type
add URL [type] Add URL
skip N Skip URL #N
unskip N Unskip URL #N
type N TYPE Set type for URL #N
skip-pattern PATTERN Skip all matching
type-pattern PATTERN TYPE Set type for all matching
auto Auto-detect types
summary Show type counts
save Save
quit Save and exit

Example session:

./start.sh curate jobs/example_com/

curator> auto
curator> type-pattern /gallery/ gallery
curator> skip-pattern /old-blog/
curator> summary
curator> save
curator> quit

Step 3: Tagger

Define CSS selectors for content.

./start.sh tagger JOB_DIR [options]
Option What it does
--show Show current tags
--edit Open tags.json in editor
--add NAME SELECTOR Add global tag
--add-to TYPE NAME SELECTOR Add page-type tag
--remove NAME [TYPE] Remove tag

Interactive commands:

Command What it does
add NAME SELECTOR Add global tag
add NAME SELECTOR TYPE Add to page type
remove NAME [TYPE] Remove tag
show Display all tags
edit Open in editor
quit Exit

Example:

./start.sh tagger jobs/example_com/

tagger> add nav .main-navigation
tagger> add hero .hero-section homepage
tagger> add content .entry-content article
tagger> add gallery_grid .gallery-container gallery
tagger> show
tagger> quit

Selector quick reference:

Pattern Matches
#myId ID
.myClass Class
.class1.class2 Multiple classes
.parent .child Nested
.parent > .child Direct child
nav, .navigation Either (fallback)
[data-type="hero"] Attribute

Step 4: Preview

Visual QA before full run.

./start.sh preview JOB_DIR [options]
Option Default What it does
--sample=N 5 Pages to preview
--type=TYPE all Only this type
--urls=URL1,URL2 Specific URLs

Output: preview/preview.xlsx with screenshots + extraction results per tag.

What to check:

  • ✓ = selector found content
  • ✗ = selector found nothing (fix selector)
  • Too much text = selector too broad
  • Missing content = selector too narrow

Step 5: Runner

Fetch pages with Playwright.

./start.sh runner JOB_DIR [options]
Option Default What it does
--action-set=NAME basic Which actions to run
--type=TYPE Only this page type
--all All non-skipped URLs
--test-url=URL Single URL test
--retry-failed Retry previously failed URLs
--reset Clear fetched.json, start fresh

Resume support: Runner automatically skips already-completed URLs. Just run again to continue where you left off.

Built-in action sets:

Name What it does
basic Load + dismiss cookies
homepage Load + hover nav
article Load + click “read more”
scroll_load Scroll 3x (lazy loading)
age_gate Handle age verification
gallery_modal Click through gallery images
nav_hover Extract dropdown menus
click_modal Click and capture modals
pixel_click Click at coordinates

Example workflow:

# Different action sets per page type
./start.sh runner jobs/example_com/ --action-set=homepage --type=homepage
./start.sh runner jobs/example_com/ --action-set=article --type=article  
./start.sh runner jobs/example_com/ --action-set=gallery_modal --type=gallery

# Or just basic for everything
./start.sh runner jobs/example_com/ --all

Step 6: Extract

Apply tags to saved HTML.

./start.sh extract JOB_DIR [options]
Option What it does
--clean Remove scripts, tracking, hidden elements

For each tag, extracts:

  • found – true/false
  • html – Raw HTML
  • text – Text only
  • links – Array of {text, href}
  • images – Array of {src, alt}
  • videos – Array of {src, type}

Step 7: Convert

Map tags to HTML Builder modules or generate semantic tag groups.

./start.sh convert JOB_DIR [options]
Option What it does
--interactive Prompt for each tag
--mapping=TAG:MOD,... Set from command line
--tag-groups Generate semantic tag groups (new format)

Output Formats

Default: Module-based (htmlbuilder_import.json)

{
  "pages": [
    {
      "name": "About",
      "modules": [
        {"type": "navigation", "data": {...}},
        {"type": "text", "data": {...}}
      ]
    }
  ]
}

Tag Groups (--tag-groupstag_groups.json)

{
  "tag_groups": {
    "sitename-nav": {
      "context_aware": false,
      "value": {"items": [...]}
    },
    "sitename-h1s": {
      "context_aware": true,
      "by_page": {
        "home": {"value": "Welcome"},
        "about": {"value": "About Us"}
      }
    }
  }
}

When to Use Each

Format Use When
Module-based Building static pages, each page independent
Tag Groups Building dynamic pages, content varies by context

Tag Groups: Context Awareness

Global tags (same on all pages):

  • nav, navigation, footer, sidebar, header

Page-specific tags (vary per page):

  • hero, content, page_header, title, gallery items

The converter auto-detects context-awareness by checking if values differ across pages.

Module types:

Type For
navigation Nav menus
hero Hero sections
text Content blocks
image Single images
gallery Image galleries
cards Card grids
video Video embeds
footer Footers
html Raw HTML fallback
skip Exclude

Action Reference

Actions are JSON objects with "do": "action_name". Used in action sets.

Action Example What it does
goto {"do": "goto", "wait": "networkidle"} Load page
goto {"do": "goto", "url": "https://...", "wait": "load"} Load specific URL
wait {"do": "wait", "seconds": 2} Wait N seconds

Wait conditions: load, domcontentloaded, networkidle

Mouse Actions

Action Example What it does
click {"do": "click", "selector": ".btn"} Click element
click {"do": "click", "selector": ".maybe", "optional": true} Click if exists
hover {"do": "hover", "selector": "nav"} Hover element
click_xy {"do": "click_xy", "x": 960, "y": 540} Click coordinates
hover_xy {"do": "hover_xy", "x": 100, "y": 50} Hover coordinates

Input

Action Example What it does
type {"do": "type", "selector": "#search", "text": "query"} Type text
press {"do": "press", "key": "Enter"} Press key

Scrolling

Action Example What it does
scroll {"do": "scroll", "to": "bottom"} Scroll to bottom
scroll {"do": "scroll", "to": "top"} Scroll to top
scroll {"do": "scroll", "to": 500} Scroll to pixel

Content Extraction

Action What it does
extract Save page HTML
hover_extract Hover, capture what appears
click_extract Click, capture modal/overlay
extract_gallery Click through gallery, get all images

hover_extract:

{
  "do": "hover_extract",
  "selector": ".nav-item",
  "wait": 0.5,
  "extract": ".dropdown-menu",
  "name": "nav_dropdown"
}

click_extract:

{
  "do": "click_extract",
  "selector": ".thumbnail",
  "wait": 1,
  "extract": ".modal",
  "close": ".modal-close",
  "name": "modal_content"
}

extract_gallery:

{
  "do": "extract_gallery",
  "config": {
    "first_item": ".gallery-item",
    "modal": "#imageModal",
    "next_btn": ".modal-next",
    "close_btn": ".modal-close",
    "image": "#modalImage",
    "caption": "#modalCaption",
    "counter": "#modalCounter",
    "max_items": 100
  }
}

Debug

Action Example What it does
screenshot {"do": "screenshot", "name": "step1"} Save screenshot
eval {"do": "eval", "js": "..."} Run JavaScript

Viewport Config

Set in action set (for pixel-accurate clicking):

{
  "name": "my_action_set",
  "viewport_width": 1920,
  "viewport_height": 1080,
  "device_scale": 1,
  "actions": [...]
}

Custom Action Set Example

Save as jobs/{domain}/action-sets/my_custom.json:

{
  "name": "wordpress_elementor",
  "description": "Handle Elementor sites",
  "actions": [
    {"do": "goto", "wait": "networkidle"},
    {"do": "click", "selector": ".elementor-popup .close", "optional": true},
    {"do": "wait", "seconds": 1},
    {"do": "scroll", "to": "bottom"},
    {"do": "wait", "seconds": 2},
    {"do": "extract"}
  ]
}

File Formats

urls.json

{
  "site": {"url": "...", "domain": "..."},
  "urls": [
    {"url": "...", "title": "...", "page_type": "article", "skip": false}
  ]
}

tags.json

{
  "global": {
    "nav": ".selector"
  },
  "page_types": {
    "article": {
      "content": ".selector"
    }
  }
}

Action Set

{
  "name": "...",
  "viewport_width": 1400,
  "viewport_height": 900,
  "actions": [
    {"do": "goto"},
    {"do": "extract"}
  ]
}

Troubleshooting

Problem Fix
“No module named playwright” rm -rf venv then run again
Selector finds nothing Test in browser: document.querySelector('.x')
Gallery loops forever Set max_items in config
Pixel clicks miss Match viewport to browser exactly
Astro/React breaks navigation Use wait between actions
Permission denied chmod +x start.sh
Job stopped partway Just run again – auto-resumes
Want to re-fetch everything ./start.sh runner JOB --reset --all
Some pages failed ./start.sh runner JOB --retry-failed

Full Example: Dragon Lady Site

# 1. Find URLs
./start.sh discover https://dragonladysf.com

# 2. Organize
./start.sh curate jobs/dragonladysf_com/
# type-pattern /gallery/ gallery
# type-pattern /explore/ gallery
# type 0 homepage

# 3. Define selectors
./start.sh tagger jobs/dragonladysf_com/
# add nav .vertical-nav
# add hero .hero lowerlevel
# add content .full-content lowerlevel
# add gallery_grid .gallery-container gallery
# add modal #imageModal gallery

# 4. Preview
./start.sh preview jobs/dragonladysf_com/ --sample=3

# 5. Fetch (different action sets per type)
./start.sh runner jobs/dragonladysf_com/ --action-set=basic --type=homepage
./start.sh runner jobs/dragonladysf_com/ --action-set=basic --type=lowerlevel
./start.sh runner jobs/dragonladysf_com/ --action-set=gallery_modal --type=gallery

# 6. Extract
./start.sh extract jobs/dragonladysf_com/ --clean

# 7. Convert
./start.sh convert jobs/dragonladysf_com/ --interactive

Files in Package

accid/
├── start.sh           # Entry point (handles venv + deps)
├── README.md          # This file
├── discover.py        # URL discovery + platform detection
├── platforms.py       # Platform-specific structure extraction
├── curate.py          # URL organization
├── tagger.py          # Selector management
├── preview.py         # Visual QA
├── runner.py          # Page fetching
├── extract.py         # Content extraction
├── convert.py         # Module mapping + tag groups
├── vault_export.py    # Vault-ready export generation
├── example_tags.json  # Sample tags file
├── example_tag_groups.json  # Sample tag groups output
└── action-sets/       # Built-in action sets
    ├── gallery_modal.json
    ├── nav_hover.json
    ├── click_modal.json
    └── pixel_click.json

Vault Export

Generate everything Vault needs to rebuild “dynamic” pages as static files.

./start.sh vault-export jobs/example_com/

Requirements

  1. Run discover with --full-structure:

    ./start.sh discover https://site.com --full-structure
    
  2. Run convert with --tag-groups:

    ./start.sh convert jobs/site_com/ --tag-groups
    
  3. Generate Vault export:

    python vault_export.py jobs/site_com/
    

What It Generates

Content Description
tag_groups Context-aware content bindings
taxonomies Categories, tags, authors
relationships Posts mapped to categories/tags/authors
generated_pages Category archives, tag archives, date archives
navigation Main nav, category nav, footer
search_index Full-text search data

Generated Archive Pages

For a WordPress site with 5 categories and 10 tags, Vault export creates:

  • 5 category archive pages (/category/tech/, /category/news/, etc.)
  • 10 tag archive pages (/tag/python/, /tag/javascript/, etc.)
  • Author archive pages (/author/john-doe/)
  • Year archives (/archive/2024/)
  • Month archives (/archive/2024/01/)

Each archive page includes:

  • Page metadata (title, description, post count)
  • List of posts in that archive
  • Template hint for rendering

How Vault Uses This

// In Vault: Load the export
const vaultData = await fetch('/vault_export.json').then(r => r.json());

// Get all posts in a category
const techPosts = vaultData.relationships.posts_by_category['tech'];

// Generate static category page
const categoryPage = {
  title: vaultData.taxonomies.categories['tech'].name,
  posts: techPosts,
  template: 'archive'
};

// Context-aware content
const pageH1 = vaultData.tag_groups['site-h1s'].by_page[currentPage];

The Key Insight

No database needed. All relationships are pre-computed:

  • Post → Categories mapping: posts_by_category
  • Post → Tags mapping: posts_by_tag
  • Category hierarchy: category_hierarchy
  • Search index: search_index

Vault just reads JSON and renders. “Dynamic” pages are actually static files generated from this data.