Site content extraction using Playwright.
Extract content from websites → organize by page type → output for HTML Builder.
Table of Contents
- Quick Start
- Pipeline Overview
- Step 1: Discover
- Step 2: Curate
- Step 3: Tagger
- Step 4: Preview
- Step 5: Runner
- Step 6: Extract
- Step 7: Convert
- Action Reference
- File Formats
- Troubleshooting
Quick Start
chmod +x start.sh
./start.sh discover https://example.com
./start.sh curate jobs/example_com/
./start.sh tagger jobs/example_com/
./start.sh preview jobs/example_com/
./start.sh runner jobs/example_com/ --all
./start.sh extract jobs/example_com/
./start.sh convert jobs/example_com/
First run auto-installs: Playwright, Chromium, beautifulsoup4, openpyxl, pillow.
Pipeline Overview
DISCOVER → CURATE → TAGGER → PREVIEW → RUNNER → EXTRACT → CONVERT
│ │ │ │ │ │ │
Find Organize Define Test Fetch Apply Map to
URLs by type selectors first pages tags modules
Step 1: Discover
Find all URLs on a site, optionally with full platform structure.
./start.sh discover URL [options]
| Option | Default | What it does |
|---|---|---|
--max-pages=N |
200 | Limit crawl |
--sitemap-only |
– | Only parse sitemaps |
--crawl-only |
– | Only follow links |
--full-structure |
– | Extract categories, tags, posts (WordPress) |
--output=FILE |
jobs/{domain}/urls.json |
Custom output |
Platform Detection
Automatically detects: WordPress, Shopify, Squarespace, Wix, Webflow, Ghost, Drupal, Astro, Next.js
Full Structure Mode (WordPress)
./start.sh discover https://wordpress-site.com --full-structure
Uses WordPress REST API to extract:
- Categories with hierarchy
- Tags
- Authors
- Posts mapped to taxonomies
- Pages with parent/child relationships
This enables Vault to regenerate category pages, tag pages, archives – all the “dynamic” pages as static files.
Output with –full-structure:
{
"site": {"platform": "wordpress"},
"urls": [...],
"structure": {
"categories": {"tech": {"name": "Technology", "count": 45}},
"tags": {"python": {"name": "Python", "count": 28}},
"posts_by_category": {"tech": [{"title": "...", "slug": "..."}]},
"posts_by_tag": {"python": [...]},
"category_hierarchy": {"tech": {"children": ["ai", "web"]}}
}
}
Examples:
./start.sh discover https://example.com
./start.sh discover https://example.com --max-pages=50
./start.sh discover https://wordpress-site.com --sitemap-only
Step 2: Curate
Organize URLs and assign page types.
./start.sh curate JOB_DIR [options]
| Option | What it does |
|---|---|
--list |
Show all URLs |
--auto |
Auto-detect page types |
--add-url=URL |
Add URL manually |
--skip-pattern=PATTERN |
Skip matching URLs |
--set-type=TYPE --pattern=PATTERN |
Set type for matches |
Interactive commands:
| Command | What it does |
|---|---|
list |
Show all URLs |
list TYPE |
Show URLs of type |
add URL [type] |
Add URL |
skip N |
Skip URL #N |
unskip N |
Unskip URL #N |
type N TYPE |
Set type for URL #N |
skip-pattern PATTERN |
Skip all matching |
type-pattern PATTERN TYPE |
Set type for all matching |
auto |
Auto-detect types |
summary |
Show type counts |
save |
Save |
quit |
Save and exit |
Example session:
./start.sh curate jobs/example_com/
curator> auto
curator> type-pattern /gallery/ gallery
curator> skip-pattern /old-blog/
curator> summary
curator> save
curator> quit
Step 3: Tagger
Define CSS selectors for content.
./start.sh tagger JOB_DIR [options]
| Option | What it does |
|---|---|
--show |
Show current tags |
--edit |
Open tags.json in editor |
--add NAME SELECTOR |
Add global tag |
--add-to TYPE NAME SELECTOR |
Add page-type tag |
--remove NAME [TYPE] |
Remove tag |
Interactive commands:
| Command | What it does |
|---|---|
add NAME SELECTOR |
Add global tag |
add NAME SELECTOR TYPE |
Add to page type |
remove NAME [TYPE] |
Remove tag |
show |
Display all tags |
edit |
Open in editor |
quit |
Exit |
Example:
./start.sh tagger jobs/example_com/
tagger> add nav .main-navigation
tagger> add hero .hero-section homepage
tagger> add content .entry-content article
tagger> add gallery_grid .gallery-container gallery
tagger> show
tagger> quit
Selector quick reference:
| Pattern | Matches |
|---|---|
#myId |
ID |
.myClass |
Class |
.class1.class2 |
Multiple classes |
.parent .child |
Nested |
.parent > .child |
Direct child |
nav, .navigation |
Either (fallback) |
[data-type="hero"] |
Attribute |
Step 4: Preview
Visual QA before full run.
./start.sh preview JOB_DIR [options]
| Option | Default | What it does |
|---|---|---|
--sample=N |
5 | Pages to preview |
--type=TYPE |
all | Only this type |
--urls=URL1,URL2 |
– | Specific URLs |
Output: preview/preview.xlsx with screenshots + extraction results per tag.
What to check:
- ✓ = selector found content
- ✗ = selector found nothing (fix selector)
- Too much text = selector too broad
- Missing content = selector too narrow
Step 5: Runner
Fetch pages with Playwright.
./start.sh runner JOB_DIR [options]
| Option | Default | What it does |
|---|---|---|
--action-set=NAME |
basic | Which actions to run |
--type=TYPE |
– | Only this page type |
--all |
– | All non-skipped URLs |
--test-url=URL |
– | Single URL test |
--retry-failed |
– | Retry previously failed URLs |
--reset |
– | Clear fetched.json, start fresh |
Resume support: Runner automatically skips already-completed URLs. Just run again to continue where you left off.
Built-in action sets:
| Name | What it does |
|---|---|
basic |
Load + dismiss cookies |
homepage |
Load + hover nav |
article |
Load + click “read more” |
scroll_load |
Scroll 3x (lazy loading) |
age_gate |
Handle age verification |
gallery_modal |
Click through gallery images |
nav_hover |
Extract dropdown menus |
click_modal |
Click and capture modals |
pixel_click |
Click at coordinates |
Example workflow:
# Different action sets per page type
./start.sh runner jobs/example_com/ --action-set=homepage --type=homepage
./start.sh runner jobs/example_com/ --action-set=article --type=article
./start.sh runner jobs/example_com/ --action-set=gallery_modal --type=gallery
# Or just basic for everything
./start.sh runner jobs/example_com/ --all
Step 6: Extract
Apply tags to saved HTML.
./start.sh extract JOB_DIR [options]
| Option | What it does |
|---|---|
--clean |
Remove scripts, tracking, hidden elements |
For each tag, extracts:
found– true/falsehtml– Raw HTMLtext– Text onlylinks– Array of {text, href}images– Array of {src, alt}videos– Array of {src, type}
Step 7: Convert
Map tags to HTML Builder modules or generate semantic tag groups.
./start.sh convert JOB_DIR [options]
| Option | What it does |
|---|---|
--interactive |
Prompt for each tag |
--mapping=TAG:MOD,... |
Set from command line |
--tag-groups |
Generate semantic tag groups (new format) |
Output Formats
Default: Module-based (htmlbuilder_import.json)
{
"pages": [
{
"name": "About",
"modules": [
{"type": "navigation", "data": {...}},
{"type": "text", "data": {...}}
]
}
]
}
Tag Groups (--tag-groups → tag_groups.json)
{
"tag_groups": {
"sitename-nav": {
"context_aware": false,
"value": {"items": [...]}
},
"sitename-h1s": {
"context_aware": true,
"by_page": {
"home": {"value": "Welcome"},
"about": {"value": "About Us"}
}
}
}
}
When to Use Each
| Format | Use When |
|---|---|
| Module-based | Building static pages, each page independent |
| Tag Groups | Building dynamic pages, content varies by context |
Tag Groups: Context Awareness
Global tags (same on all pages):
- nav, navigation, footer, sidebar, header
Page-specific tags (vary per page):
- hero, content, page_header, title, gallery items
The converter auto-detects context-awareness by checking if values differ across pages.
Module types:
| Type | For |
|---|---|
navigation |
Nav menus |
hero |
Hero sections |
text |
Content blocks |
image |
Single images |
gallery |
Image galleries |
cards |
Card grids |
video |
Video embeds |
footer |
Footers |
html |
Raw HTML fallback |
skip |
Exclude |
Action Reference
Actions are JSON objects with "do": "action_name". Used in action sets.
Navigation & Timing
| Action | Example | What it does |
|---|---|---|
goto |
{"do": "goto", "wait": "networkidle"} |
Load page |
goto |
{"do": "goto", "url": "https://...", "wait": "load"} |
Load specific URL |
wait |
{"do": "wait", "seconds": 2} |
Wait N seconds |
Wait conditions: load, domcontentloaded, networkidle
Mouse Actions
| Action | Example | What it does |
|---|---|---|
click |
{"do": "click", "selector": ".btn"} |
Click element |
click |
{"do": "click", "selector": ".maybe", "optional": true} |
Click if exists |
hover |
{"do": "hover", "selector": "nav"} |
Hover element |
click_xy |
{"do": "click_xy", "x": 960, "y": 540} |
Click coordinates |
hover_xy |
{"do": "hover_xy", "x": 100, "y": 50} |
Hover coordinates |
Input
| Action | Example | What it does |
|---|---|---|
type |
{"do": "type", "selector": "#search", "text": "query"} |
Type text |
press |
{"do": "press", "key": "Enter"} |
Press key |
Scrolling
| Action | Example | What it does |
|---|---|---|
scroll |
{"do": "scroll", "to": "bottom"} |
Scroll to bottom |
scroll |
{"do": "scroll", "to": "top"} |
Scroll to top |
scroll |
{"do": "scroll", "to": 500} |
Scroll to pixel |
Content Extraction
| Action | What it does |
|---|---|
extract |
Save page HTML |
hover_extract |
Hover, capture what appears |
click_extract |
Click, capture modal/overlay |
extract_gallery |
Click through gallery, get all images |
hover_extract:
{
"do": "hover_extract",
"selector": ".nav-item",
"wait": 0.5,
"extract": ".dropdown-menu",
"name": "nav_dropdown"
}
click_extract:
{
"do": "click_extract",
"selector": ".thumbnail",
"wait": 1,
"extract": ".modal",
"close": ".modal-close",
"name": "modal_content"
}
extract_gallery:
{
"do": "extract_gallery",
"config": {
"first_item": ".gallery-item",
"modal": "#imageModal",
"next_btn": ".modal-next",
"close_btn": ".modal-close",
"image": "#modalImage",
"caption": "#modalCaption",
"counter": "#modalCounter",
"max_items": 100
}
}
Debug
| Action | Example | What it does |
|---|---|---|
screenshot |
{"do": "screenshot", "name": "step1"} |
Save screenshot |
eval |
{"do": "eval", "js": "..."} |
Run JavaScript |
Viewport Config
Set in action set (for pixel-accurate clicking):
{
"name": "my_action_set",
"viewport_width": 1920,
"viewport_height": 1080,
"device_scale": 1,
"actions": [...]
}
Custom Action Set Example
Save as jobs/{domain}/action-sets/my_custom.json:
{
"name": "wordpress_elementor",
"description": "Handle Elementor sites",
"actions": [
{"do": "goto", "wait": "networkidle"},
{"do": "click", "selector": ".elementor-popup .close", "optional": true},
{"do": "wait", "seconds": 1},
{"do": "scroll", "to": "bottom"},
{"do": "wait", "seconds": 2},
{"do": "extract"}
]
}
File Formats
urls.json
{
"site": {"url": "...", "domain": "..."},
"urls": [
{"url": "...", "title": "...", "page_type": "article", "skip": false}
]
}
tags.json
{
"global": {
"nav": ".selector"
},
"page_types": {
"article": {
"content": ".selector"
}
}
}
Action Set
{
"name": "...",
"viewport_width": 1400,
"viewport_height": 900,
"actions": [
{"do": "goto"},
{"do": "extract"}
]
}
Troubleshooting
| Problem | Fix |
|---|---|
| “No module named playwright” | rm -rf venv then run again |
| Selector finds nothing | Test in browser: document.querySelector('.x') |
| Gallery loops forever | Set max_items in config |
| Pixel clicks miss | Match viewport to browser exactly |
| Astro/React breaks navigation | Use wait between actions |
| Permission denied | chmod +x start.sh |
| Job stopped partway | Just run again – auto-resumes |
| Want to re-fetch everything | ./start.sh runner JOB --reset --all |
| Some pages failed | ./start.sh runner JOB --retry-failed |
Full Example: Dragon Lady Site
# 1. Find URLs
./start.sh discover https://dragonladysf.com
# 2. Organize
./start.sh curate jobs/dragonladysf_com/
# type-pattern /gallery/ gallery
# type-pattern /explore/ gallery
# type 0 homepage
# 3. Define selectors
./start.sh tagger jobs/dragonladysf_com/
# add nav .vertical-nav
# add hero .hero lowerlevel
# add content .full-content lowerlevel
# add gallery_grid .gallery-container gallery
# add modal #imageModal gallery
# 4. Preview
./start.sh preview jobs/dragonladysf_com/ --sample=3
# 5. Fetch (different action sets per type)
./start.sh runner jobs/dragonladysf_com/ --action-set=basic --type=homepage
./start.sh runner jobs/dragonladysf_com/ --action-set=basic --type=lowerlevel
./start.sh runner jobs/dragonladysf_com/ --action-set=gallery_modal --type=gallery
# 6. Extract
./start.sh extract jobs/dragonladysf_com/ --clean
# 7. Convert
./start.sh convert jobs/dragonladysf_com/ --interactive
Files in Package
accid/
├── start.sh # Entry point (handles venv + deps)
├── README.md # This file
├── discover.py # URL discovery + platform detection
├── platforms.py # Platform-specific structure extraction
├── curate.py # URL organization
├── tagger.py # Selector management
├── preview.py # Visual QA
├── runner.py # Page fetching
├── extract.py # Content extraction
├── convert.py # Module mapping + tag groups
├── vault_export.py # Vault-ready export generation
├── example_tags.json # Sample tags file
├── example_tag_groups.json # Sample tag groups output
└── action-sets/ # Built-in action sets
├── gallery_modal.json
├── nav_hover.json
├── click_modal.json
└── pixel_click.json
Vault Export
Generate everything Vault needs to rebuild “dynamic” pages as static files.
./start.sh vault-export jobs/example_com/
Requirements
-
Run discover with
--full-structure:./start.sh discover https://site.com --full-structure -
Run convert with
--tag-groups:./start.sh convert jobs/site_com/ --tag-groups -
Generate Vault export:
python vault_export.py jobs/site_com/
What It Generates
| Content | Description |
|---|---|
tag_groups |
Context-aware content bindings |
taxonomies |
Categories, tags, authors |
relationships |
Posts mapped to categories/tags/authors |
generated_pages |
Category archives, tag archives, date archives |
navigation |
Main nav, category nav, footer |
search_index |
Full-text search data |
Generated Archive Pages
For a WordPress site with 5 categories and 10 tags, Vault export creates:
- 5 category archive pages (
/category/tech/,/category/news/, etc.) - 10 tag archive pages (
/tag/python/,/tag/javascript/, etc.) - Author archive pages (
/author/john-doe/) - Year archives (
/archive/2024/) - Month archives (
/archive/2024/01/)
Each archive page includes:
- Page metadata (title, description, post count)
- List of posts in that archive
- Template hint for rendering
How Vault Uses This
// In Vault: Load the export
const vaultData = await fetch('/vault_export.json').then(r => r.json());
// Get all posts in a category
const techPosts = vaultData.relationships.posts_by_category['tech'];
// Generate static category page
const categoryPage = {
title: vaultData.taxonomies.categories['tech'].name,
posts: techPosts,
template: 'archive'
};
// Context-aware content
const pageH1 = vaultData.tag_groups['site-h1s'].by_page[currentPage];
The Key Insight
No database needed. All relationships are pre-computed:
- Post → Categories mapping:
posts_by_category - Post → Tags mapping:
posts_by_tag - Category hierarchy:
category_hierarchy - Search index:
search_index
Vault just reads JSON and renders. “Dynamic” pages are actually static files generated from this data.
