npm CLI
contextractor is the standalone command-line tool for extracting clean content from websites. It is built on the Rust port of Trafilatura (extraction) and Crawlee (TypeScript crawler driving Playwright).
To embed extraction in your own Node.js code instead, see the npm library page.
Install
npm install contextractor
npx playwright install chromium
Requires Node 22+. Playwright Chromium is needed for browser-based crawling.
Quick start
Extract a page, then export the stored content to a folder of human-named files:
npx contextractor extract https://example.com
npx contextractor export --output-dir ./contextractor-output
extract saves each page to local storage (the default key-value store, indexed by a dataset record). export reads that index and writes one file per saved format per successful page, plus a manifest.json.
Subcommands
The CLI has three subcommands: extract, export, and purge.
extract
contextractor extract [URLS...]
Extract content from one or more URLs and save to storage. A dataset record is pushed for every successful page (status: 'success'); failed requests are pushed with status: 'failed', and skipped URLs can be recorded with --store-skipped-urls. The CLI exits with code 2 when at least one request fails after retries.
contextractor extract https://example.com --mode precision --save json
contextractor extract https://a.com https://b.com --save txt
contextractor extract --input-file urls.txt --dataset my-archive
export
contextractor export
Export stored extraction content to a user-facing output directory. Reads the dataset record index and, for every success record, writes one file per saved format — using the inline content or fetching the key-value-store blob by key. File names are derived from the record title (then its URL, then page), and a manifest.json listing every record (including failed and skipped) is written alongside the files.
contextractor export # → ./contextractor-output
contextractor export --output-dir ./out --dataset my-archive
purge
contextractor purge # purge default dataset and key-value store
contextractor purge --all # purge all datasets and key-value stores
Command reference
Storage
| Option | Description |
|---|---|
--dataset <name> | Route output to a named dataset (default: default) |
--key-value-store <name> | Route content blobs to a named key-value store (default: default) |
--request-queue <name> | Route pending URLs to a named request queue |
--save-destination <dest> | Where to save, repeatable: key-value-store (default) or dataset |
--storage-dir <path> | Override Crawlee storage directory |
--input-file <file> | Read URLs (one per line) from a file |
--config, -c <path> | Path to JSON config file |
--clean | Purge default storage before extracting |
The export command accepts --output-dir <path> (default ./contextractor-output), --dataset <name> (default default), --key-value-store <name> (default default), and --storage-dir <path>. There is no --request-queue on export — queues hold pending URLs, not content.
Crawl settings
| Option | Description |
|---|---|
--crawler-type <type> | Crawler engine: adaptive (default), firefox, chromium, cheerio |
--rendering-type-detection <ratio> | Rendering type detection ratio 0–1 (adaptive only, e.g. 0.1) |
--max-requests-per-crawl <n> | Max requests to handle (0 = unlimited) |
--max-crawl-depth <n> | Max link depth from start URLs (0 = start only) |
--max-results <n> | Max results per crawl (0 = unlimited) |
--initial-concurrency <n> | Initial parallel requests (0 = Crawlee default) |
--max-concurrency <n> | Max parallel requests |
--max-retries <n> | Max request retries |
Crawl filtering
| Option | Description |
|---|---|
--globs <pattern> | Glob pattern to include (repeatable) |
--exclude <pattern> | Glob pattern to exclude (repeatable) |
--selector <css> | CSS selector for links to follow |
--keep-url-fragment | Preserve URL fragments |
--use-sitemaps | Discover and enqueue URLs from sitemap.xml at each start URL domain root |
--respect-robots-txt | Honor robots.txt |
--deduplication <level> | Deduplication level: none, url (default), or content-hash |
Browser
| Option | Description |
|---|---|
--headless / --no-headless | Browser headless mode (default: headless) |
--wait-until <event> | Page load event: load, domcontentloaded, networkidle, commit |
--navigation-timeout <secs> | Page load timeout in seconds |
--wait-for-dynamic-content <secs> | Seconds to wait for network idle after navigation (0 = disabled) |
--wait-for-selector <css> | CSS selector to wait for before extracting (fails on timeout) |
--soft-wait-for-selector <css> | CSS selector to wait for before extracting (continues on timeout) |
--block-media / --no-block-media | Block images, stylesheets, fonts, PDFs, and ZIPs (default: off) |
--ignore-cors-and-csp | Disable CORS/CSP restrictions |
--ignore-https-errors | Skip SSL certificate verification |
--close-cookie-modals | Auto-dismiss cookie banners |
--max-scroll-height <px> | Max scroll height in pixels |
--user-agent <ua> | Custom User-Agent string |
Proxy & sessions
| Option | Description |
|---|---|
--proxy <url> | Proxy URL (repeatable) |
--proxy-rotation <strategy> | Rotation: recommended, per-request, until-failure |
--session-pool-name <name> | Named session pool for cross-run session sharing |
--max-session-rotations <n> | Max session rotations per request on block detection |
Cookies & headers
| Option | Description |
|---|---|
--cookies <json> | JSON array of cookie objects |
--headers <json> | JSON object of custom HTTP headers |
Output & extraction
| Option | Description |
|---|---|
--save <format> | Output format, repeatable: markdown (default), txt, json, html, original, all |
--mode <mode> | Extraction mode: precision (less noise), balanced (default), recall (more content) |
--language <lang> | Filter by language (e.g. en) |
--no-links | Exclude links from output |
--no-comments | Exclude comments from output |
--no-tables | Exclude tables from output |
--images / --no-images | Include image alt text and captions (default: off) |
--store-skipped-urls | Push skipped URL records to the dataset after crawl |
--verbose, -v | Enable verbose logging |
JSON config
Pass --config path/to/config.json. Keys use the same camelCase shape as the Apify input schema. Orchestration flags (--proxy, --clean) are CLI-only and must be set on the command line.
contextractor extract --config config.json --max-requests-per-crawl 10
{
"startUrls": [{ "url": "https://example.com" }],
"headless": false,
"maxRequestsPerCrawl": 10,
"mode": "recall",
"includeImages": true,
"save": ["txt"],
"saveDestination": ["dataset"],
"datasetName": "my-archive"
}
Config merge order: schema defaults → config file → explicit CLI args. Unknown keys are stripped.
Storage directory resolution
The storage directory is resolved in this order (first match wins):
--storage-dirCLI flagCONTEXTRACTOR_STORAGE_DIRenv varCRAWLEE_STORAGE_DIRenv var (Crawlee native compatibility)./storageif.actor/or./storage/exists in the current working directory${XDG_DATA_HOME:-~/.local/share}/contextractor/storage(XDG fallback)
Where to go next
- npm library — embed extraction in your own Node.js application.
- Apify Actor — run extraction at scale on the Apify platform.
- Playground — paste HTML and preview extraction results in your browser.
Updated: June 3, 2026