CLI
Install via npm:
npm install -g contextractor
Extract content from a URL:
contextractor https://example.com
Command reference
contextractor [OPTIONS] [URLS...]
Crawl settings
| Option | Description |
|---|---|
--config, -c | Path to JSON config file (optional) |
--output-dir, -o | Output directory |
--format, -f | Output format: txt, markdown, json, jsonl, xml, xmltei |
--max-pages | Max pages to crawl (0 = unlimited) |
--crawl-depth | Max link depth from start URLs (0 = start only) |
--headless / --no-headless | Browser headless mode (default: headless) |
--max-concurrency | Max parallel requests (default: 50) |
--max-retries | Max request retries (default: 3) |
--max-results | Max results per crawl (0 = unlimited) |
Proxy
| Option | Description |
|---|---|
--proxy-urls | Comma-separated proxy URLs (http://user:pass@host:port) |
--proxy-rotation | Rotation: recommended, per_request, until_failure |
Browser
| Option | Description |
|---|---|
--launcher | Browser engine: chromium, firefox (default: chromium) |
--wait-until | Page load event: load, networkidle, domcontentloaded (default: load) |
--page-load-timeout | Timeout in seconds (default: 60) |
--ignore-cors | Disable CORS/CSP restrictions |
--close-cookie-modals | Auto-dismiss cookie banners |
--max-scroll-height | Max scroll height in pixels (default: 5000) |
--ignore-ssl-errors | Skip SSL certificate verification |
--user-agent | Custom User-Agent string |
Crawl filtering
| Option | Description |
|---|---|
--globs | Comma-separated glob patterns to include |
--excludes | Comma-separated glob patterns to exclude |
--link-selector | CSS selector for links to follow |
--keep-url-fragments | Preserve URL fragments |
--respect-robots-txt | Honor robots.txt |
Cookies & headers
| Option | Description |
|---|---|
--cookies | JSON array of cookie objects |
--headers | JSON object of custom HTTP headers |
Output toggles
| Option | Description |
|---|---|
--save-raw-html | Save raw HTML to output |
--save-text | Save extracted text |
--save-json | Save extracted JSON |
--save-xml | Save extracted XML |
--save-xml-tei | Save extracted XML-TEI |
Content extraction
| Option | Description |
|---|---|
--precision | High precision mode (less noise) |
--recall | High recall mode (more content) |
--fast | Fast extraction mode (less thorough) |
--no-links | Exclude links from output |
--no-comments | Exclude comments from output |
--include-tables / --no-tables | Include tables (default: include) |
--include-images | Include image descriptions |
--include-formatting / --no-formatting | Preserve formatting (default: preserve) |
--deduplicate | Deduplicate extracted content |
--target-language | Filter by language (e.g. "en") |
--with-metadata / --no-metadata | Extract metadata (default: with) |
--prune-xpath | XPath patterns to remove from content |
Diagnostics
| Option | Description |
|---|---|
--verbose, -v | Enable verbose logging |
Config file
Instead of passing all options on the command line, you can use a JSON config file:
contextractor --config config.json
{
"urls": ["https://example.com", "https://docs.example.com"],
"outputFormat": "markdown",
"outputDir": "./output",
"crawlDepth": 1,
"proxy": {
"urls": ["http://user:pass@host:port"],
"rotation": "recommended"
},
"launcher": "chromium",
"waitUntil": "load",
"globs": ["https://example.com/blog/**"],
"excludes": ["https://example.com/blog/archive/**"],
"maxConcurrency": 50,
"maxRetries": 3,
"cookies": [{"name": "session", "value": "abc", "domain": ".example.com"}],
"headers": {"Authorization": "Bearer token"},
"saveRawHtml": false,
"saveText": false,
"extraction": {
"favorPrecision": true,
"includeLinks": true,
"includeTables": true,
"deduplicate": true
}
}
Crawl settings
| Field | Type | Default | Description |
|---|---|---|---|
urls | array | [] | URLs to extract |
maxPages | int | 0 | Max pages (0 = unlimited) |
outputFormat | string | "markdown" | Output format |
outputDir | string | "./output" | Output directory |
crawlDepth | int | 0 | Link follow depth |
headless | bool | true | Browser headless mode |
maxConcurrency | int | 50 | Max parallel requests |
maxRetries | int | 3 | Max request retries |
maxResults | int | 0 | Max results (0 = unlimited) |
Proxy
| Field | Type | Default | Description |
|---|---|---|---|
proxy.urls | array | [] | Proxy URLs |
proxy.rotation | string | "recommended" | recommended, per_request, until_failure |
proxy.tiered | array | [] | Tiered proxy escalation (config-file only) |
Browser
| Field | Type | Default | Description |
|---|---|---|---|
launcher | string | "chromium" | chromium or firefox |
waitUntil | string | "load" | load, networkidle, domcontentloaded |
pageLoadTimeout | int | 60 | Timeout in seconds |
ignoreCors | bool | false | Disable CORS/CSP |
closeCookieModals | bool | false | Auto-dismiss cookie banners |
maxScrollHeight | int | 5000 | Max scroll pixels (0 = disable) |
ignoreSslErrors | bool | false | Skip SSL verification |
userAgent | string | "" | Custom User-Agent string |
Crawl filtering
| Field | Type | Default | Description |
|---|---|---|---|
globs | array | [] | Glob patterns to include |
excludes | array | [] | Glob patterns to exclude |
linkSelector | string | "" | CSS selector for links |
keepUrlFragments | bool | false | Preserve URL fragments |
respectRobotsTxt | bool | false | Honor robots.txt |
Cookies & headers
| Field | Type | Default | Description |
|---|---|---|---|
cookies | array | [] | Initial cookies |
headers | object | {} | Custom HTTP headers |
Output toggles
| Field | Type | Default | Description |
|---|---|---|---|
saveRawHtml | bool | false | Save raw HTML |
saveText | bool | false | Save plain text |
saveJson | bool | false | Save JSON |
saveXml | bool | false | Save XML |
saveXmlTei | bool | false | Save XML-TEI |
Extraction options
Used in config files under the extraction key. All have equivalent CLI flags.
| Field | Type | Default | Description |
|---|---|---|---|
favorPrecision | bool | false | High precision, less noise |
favorRecall | bool | false | High recall, more content |
fast | bool | false | Fast mode (less thorough) |
includeComments | bool | true | Include comments |
includeTables | bool | true | Include tables |
includeImages | bool | false | Include images |
includeFormatting | bool | true | Preserve formatting |
includeLinks | bool | true | Include links |
deduplicate | bool | false | Deduplicate content |
withMetadata | bool | true | Extract metadata |
targetLanguage | string | null | Filter by language |
pruneXpath | array | null | XPath patterns to remove |
CLI flags override config file settings. Merge order: defaults → config file → CLI args
Updated: March 26, 2026