ACCID Pipeline

Site content extraction using Playwright.

Extract content from websites → organize by page type → output for HTML Builder.


Table of Contents

  1. Quick Start
  2. Pipeline Overview
  3. Step 1: Discover
  4. Step 2: Curate
  5. Step 3: Tagger
  6. Step 4: Preview
  7. Step 5: Runner
  8. Step 6: Extract
  9. Step 7: Convert
  10. Action Reference
  11. File Formats
  12. Troubleshooting

Quick Start

chmod +x start.sh
./start.sh discover https://example.com
./start.sh curate jobs/example_com/
./start.sh tagger jobs/example_com/
./start.sh preview jobs/example_com/
./start.sh runner jobs/example_com/ --all
./start.sh extract jobs/example_com/
./start.sh convert jobs/example_com/

First run auto-installs: Playwright, Chromium, beautifulsoup4, openpyxl, pillow.


Pipeline Overview

DISCOVER → CURATE → TAGGER → PREVIEW → RUNNER → EXTRACT → CONVERT
   │         │         │        │         │         │         │
 Find      Organize  Define    Test     Fetch    Apply     Map to
 URLs      by type  selectors  first    pages    tags     modules

Step 1: Discover

Find all URLs on a site, optionally with full platform structure.

./start.sh discover URL [options]
OptionDefaultWhat it does
--max-pages=N200Limit crawl
--sitemap-onlyOnly parse sitemaps
--crawl-onlyOnly follow links
--full-structureExtract categories, tags, posts (WordPress)
--output=FILEjobs/{domain}/urls.jsonCustom output

Platform Detection

Automatically detects: WordPress, Shopify, Squarespace, Wix, Webflow, Ghost, Drupal, Astro, Next.js

Full Structure Mode (WordPress)

./start.sh discover https://wordpress-site.com --full-structure

Uses WordPress REST API to extract:

  • Categories with hierarchy
  • Tags
  • Authors
  • Posts mapped to taxonomies
  • Pages with parent/child relationships

This enables Vault to regenerate category pages, tag pages, archives – all the “dynamic” pages as static files.

Output with –full-structure:

{
  "site": {"platform": "wordpress"},
  "urls": [...],
  "structure": {
    "categories": {"tech": {"name": "Technology", "count": 45}},
    "tags": {"python": {"name": "Python", "count": 28}},
    "posts_by_category": {"tech": [{"title": "...", "slug": "..."}]},
    "posts_by_tag": {"python": [...]},
    "category_hierarchy": {"tech": {"children": ["ai", "web"]}}
  }
}

Examples:

./start.sh discover https://example.com
./start.sh discover https://example.com --max-pages=50
./start.sh discover https://wordpress-site.com --sitemap-only

Step 2: Curate

Organize URLs and assign page types.

./start.sh curate JOB_DIR [options]
OptionWhat it does
--listShow all URLs
--autoAuto-detect page types
--add-url=URLAdd URL manually
--skip-pattern=PATTERNSkip matching URLs
--set-type=TYPE --pattern=PATTERNSet type for matches

Interactive commands:

CommandWhat it does
listShow all URLs
list TYPEShow URLs of type
add URL [type]Add URL
skip NSkip URL #N
unskip NUnskip URL #N
type N TYPESet type for URL #N
skip-pattern PATTERNSkip all matching
type-pattern PATTERN TYPESet type for all matching
autoAuto-detect types
summaryShow type counts
saveSave
quitSave and exit

Example session:

./start.sh curate jobs/example_com/

curator> auto
curator> type-pattern /gallery/ gallery
curator> skip-pattern /old-blog/
curator> summary
curator> save
curator> quit

Step 3: Tagger

Define CSS selectors for content.

./start.sh tagger JOB_DIR [options]
OptionWhat it does
--showShow current tags
--editOpen tags.json in editor
--add NAME SELECTORAdd global tag
--add-to TYPE NAME SELECTORAdd page-type tag
--remove NAME [TYPE]Remove tag

Interactive commands:

CommandWhat it does
add NAME SELECTORAdd global tag
add NAME SELECTOR TYPEAdd to page type
remove NAME [TYPE]Remove tag
showDisplay all tags
editOpen in editor
quitExit

Example:

./start.sh tagger jobs/example_com/

tagger> add nav .main-navigation
tagger> add hero .hero-section homepage
tagger> add content .entry-content article
tagger> add gallery_grid .gallery-container gallery
tagger> show
tagger> quit

Selector quick reference:

PatternMatches
#myIdID
.myClassClass
.class1.class2Multiple classes
.parent .childNested
.parent > .childDirect child
nav, .navigationEither (fallback)
[data-type="hero"]Attribute

Step 4: Preview

Visual QA before full run.

./start.sh preview JOB_DIR [options]
OptionDefaultWhat it does
--sample=N5Pages to preview
--type=TYPEallOnly this type
--urls=URL1,URL2Specific URLs

Output: preview/preview.xlsx with screenshots + extraction results per tag.

What to check:

  • ✓ = selector found content
  • ✗ = selector found nothing (fix selector)
  • Too much text = selector too broad
  • Missing content = selector too narrow

Step 5: Runner

Fetch pages with Playwright.

./start.sh runner JOB_DIR [options]
OptionDefaultWhat it does
--action-set=NAMEbasicWhich actions to run
--type=TYPEOnly this page type
--allAll non-skipped URLs
--test-url=URLSingle URL test
--retry-failedRetry previously failed URLs
--resetClear fetched.json, start fresh

Resume support: Runner automatically skips already-completed URLs. Just run again to continue where you left off.

Built-in action sets:

NameWhat it does
basicLoad + dismiss cookies
homepageLoad + hover nav
articleLoad + click “read more”
scroll_loadScroll 3x (lazy loading)
age_gateHandle age verification
gallery_modalClick through gallery images
nav_hoverExtract dropdown menus
click_modalClick and capture modals
pixel_clickClick at coordinates

Example workflow:

# Different action sets per page type
./start.sh runner jobs/example_com/ --action-set=homepage --type=homepage
./start.sh runner jobs/example_com/ --action-set=article --type=article  
./start.sh runner jobs/example_com/ --action-set=gallery_modal --type=gallery

# Or just basic for everything
./start.sh runner jobs/example_com/ --all

Step 6: Extract

Apply tags to saved HTML.

./start.sh extract JOB_DIR [options]
OptionWhat it does
--cleanRemove scripts, tracking, hidden elements

For each tag, extracts:

  • found – true/false
  • html – Raw HTML
  • text – Text only
  • links – Array of {text, href}
  • images – Array of {src, alt}
  • videos – Array of {src, type}

Step 7: Convert

Map tags to HTML Builder modules or generate semantic tag groups.

./start.sh convert JOB_DIR [options]
OptionWhat it does
--interactivePrompt for each tag
--mapping=TAG:MOD,...Set from command line
--tag-groupsGenerate semantic tag groups (new format)

Output Formats

Default: Module-based (htmlbuilder_import.json)

{
  "pages": [
    {
      "name": "About",
      "modules": [
        {"type": "navigation", "data": {...}},
        {"type": "text", "data": {...}}
      ]
    }
  ]
}

Tag Groups (--tag-groupstag_groups.json)

{
  "tag_groups": {
    "sitename-nav": {
      "context_aware": false,
      "value": {"items": [...]}
    },
    "sitename-h1s": {
      "context_aware": true,
      "by_page": {
        "home": {"value": "Welcome"},
        "about": {"value": "About Us"}
      }
    }
  }
}

When to Use Each

FormatUse When
Module-basedBuilding static pages, each page independent
Tag GroupsBuilding dynamic pages, content varies by context

Tag Groups: Context Awareness

Global tags (same on all pages):

  • nav, navigation, footer, sidebar, header

Page-specific tags (vary per page):

  • hero, content, page_header, title, gallery items

The converter auto-detects context-awareness by checking if values differ across pages.

Module types:

TypeFor
navigationNav menus
heroHero sections
textContent blocks
imageSingle images
galleryImage galleries
cardsCard grids
videoVideo embeds
footerFooters
htmlRaw HTML fallback
skipExclude

Action Reference

Actions are JSON objects with "do": "action_name". Used in action sets.

ActionExampleWhat it does
goto{"do": "goto", "wait": "networkidle"}Load page
goto{"do": "goto", "url": "https://...", "wait": "load"}Load specific URL
wait{"do": "wait", "seconds": 2}Wait N seconds

Wait conditions: load, domcontentloaded, networkidle

Mouse Actions

ActionExampleWhat it does
click{"do": "click", "selector": ".btn"}Click element
click{"do": "click", "selector": ".maybe", "optional": true}Click if exists
hover{"do": "hover", "selector": "nav"}Hover element
click_xy{"do": "click_xy", "x": 960, "y": 540}Click coordinates
hover_xy{"do": "hover_xy", "x": 100, "y": 50}Hover coordinates

Input

ActionExampleWhat it does
type{"do": "type", "selector": "#search", "text": "query"}Type text
press{"do": "press", "key": "Enter"}Press key

Scrolling

ActionExampleWhat it does
scroll{"do": "scroll", "to": "bottom"}Scroll to bottom
scroll{"do": "scroll", "to": "top"}Scroll to top
scroll{"do": "scroll", "to": 500}Scroll to pixel

Content Extraction

ActionWhat it does
extractSave page HTML
hover_extractHover, capture what appears
click_extractClick, capture modal/overlay
extract_galleryClick through gallery, get all images

hover_extract:

{
  "do": "hover_extract",
  "selector": ".nav-item",
  "wait": 0.5,
  "extract": ".dropdown-menu",
  "name": "nav_dropdown"
}

click_extract:

{
  "do": "click_extract",
  "selector": ".thumbnail",
  "wait": 1,
  "extract": ".modal",
  "close": ".modal-close",
  "name": "modal_content"
}

extract_gallery:

{
  "do": "extract_gallery",
  "config": {
    "first_item": ".gallery-item",
    "modal": "#imageModal",
    "next_btn": ".modal-next",
    "close_btn": ".modal-close",
    "image": "#modalImage",
    "caption": "#modalCaption",
    "counter": "#modalCounter",
    "max_items": 100
  }
}

Debug

ActionExampleWhat it does
screenshot{"do": "screenshot", "name": "step1"}Save screenshot
eval{"do": "eval", "js": "..."}Run JavaScript

Viewport Config

Set in action set (for pixel-accurate clicking):

{
  "name": "my_action_set",
  "viewport_width": 1920,
  "viewport_height": 1080,
  "device_scale": 1,
  "actions": [...]
}

Custom Action Set Example

Save as jobs/{domain}/action-sets/my_custom.json:

{
  "name": "wordpress_elementor",
  "description": "Handle Elementor sites",
  "actions": [
    {"do": "goto", "wait": "networkidle"},
    {"do": "click", "selector": ".elementor-popup .close", "optional": true},
    {"do": "wait", "seconds": 1},
    {"do": "scroll", "to": "bottom"},
    {"do": "wait", "seconds": 2},
    {"do": "extract"}
  ]
}

File Formats

urls.json

{
  "site": {"url": "...", "domain": "..."},
  "urls": [
    {"url": "...", "title": "...", "page_type": "article", "skip": false}
  ]
}

tags.json

{
  "global": {
    "nav": ".selector"
  },
  "page_types": {
    "article": {
      "content": ".selector"
    }
  }
}

Action Set

{
  "name": "...",
  "viewport_width": 1400,
  "viewport_height": 900,
  "actions": [
    {"do": "goto"},
    {"do": "extract"}
  ]
}

Troubleshooting

ProblemFix
“No module named playwright”rm -rf venv then run again
Selector finds nothingTest in browser: document.querySelector('.x')
Gallery loops foreverSet max_items in config
Pixel clicks missMatch viewport to browser exactly
Astro/React breaks navigationUse wait between actions
Permission deniedchmod +x start.sh
Job stopped partwayJust run again – auto-resumes
Want to re-fetch everything./start.sh runner JOB --reset --all
Some pages failed./start.sh runner JOB --retry-failed

Full Example: Dragon Lady Site

# 1. Find URLs
./start.sh discover https://dragonladysf.com

# 2. Organize
./start.sh curate jobs/dragonladysf_com/
# type-pattern /gallery/ gallery
# type-pattern /explore/ gallery
# type 0 homepage

# 3. Define selectors
./start.sh tagger jobs/dragonladysf_com/
# add nav .vertical-nav
# add hero .hero lowerlevel
# add content .full-content lowerlevel
# add gallery_grid .gallery-container gallery
# add modal #imageModal gallery

# 4. Preview
./start.sh preview jobs/dragonladysf_com/ --sample=3

# 5. Fetch (different action sets per type)
./start.sh runner jobs/dragonladysf_com/ --action-set=basic --type=homepage
./start.sh runner jobs/dragonladysf_com/ --action-set=basic --type=lowerlevel
./start.sh runner jobs/dragonladysf_com/ --action-set=gallery_modal --type=gallery

# 6. Extract
./start.sh extract jobs/dragonladysf_com/ --clean

# 7. Convert
./start.sh convert jobs/dragonladysf_com/ --interactive

Files in Package

accid/
├── start.sh           # Entry point (handles venv + deps)
├── README.md          # This file
├── discover.py        # URL discovery + platform detection
├── platforms.py       # Platform-specific structure extraction
├── curate.py          # URL organization
├── tagger.py          # Selector management
├── preview.py         # Visual QA
├── runner.py          # Page fetching
├── extract.py         # Content extraction
├── convert.py         # Module mapping + tag groups
├── vault_export.py    # Vault-ready export generation
├── example_tags.json  # Sample tags file
├── example_tag_groups.json  # Sample tag groups output
└── action-sets/       # Built-in action sets
    ├── gallery_modal.json
    ├── nav_hover.json
    ├── click_modal.json
    └── pixel_click.json

Vault Export

Generate everything Vault needs to rebuild “dynamic” pages as static files.

./start.sh vault-export jobs/example_com/

Requirements

  1. Run discover with --full-structure:

    ./start.sh discover https://site.com --full-structure
    
  2. Run convert with --tag-groups:

    ./start.sh convert jobs/site_com/ --tag-groups
    
  3. Generate Vault export:

    python vault_export.py jobs/site_com/
    

What It Generates

ContentDescription
tag_groupsContext-aware content bindings
taxonomiesCategories, tags, authors
relationshipsPosts mapped to categories/tags/authors
generated_pagesCategory archives, tag archives, date archives
navigationMain nav, category nav, footer
search_indexFull-text search data

Generated Archive Pages

For a WordPress site with 5 categories and 10 tags, Vault export creates:

  • 5 category archive pages (/category/tech/, /category/news/, etc.)
  • 10 tag archive pages (/tag/python/, /tag/javascript/, etc.)
  • Author archive pages (/author/john-doe/)
  • Year archives (/archive/2024/)
  • Month archives (/archive/2024/01/)

Each archive page includes:

  • Page metadata (title, description, post count)
  • List of posts in that archive
  • Template hint for rendering

How Vault Uses This

// In Vault: Load the export
const vaultData = await fetch('/vault_export.json').then(r => r.json());

// Get all posts in a category
const techPosts = vaultData.relationships.posts_by_category['tech'];

// Generate static category page
const categoryPage = {
  title: vaultData.taxonomies.categories['tech'].name,
  posts: techPosts,
  template: 'archive'
};

// Context-aware content
const pageH1 = vaultData.tag_groups['site-h1s'].by_page[currentPage];

The Key Insight

No database needed. All relationships are pre-computed:

  • Post → Categories mapping: posts_by_category
  • Post → Tags mapping: posts_by_tag
  • Category hierarchy: category_hierarchy
  • Search index: search_index

Vault just reads JSON and renders. “Dynamic” pages are actually static files generated from this data.