Contextractor npm CLI

contextractor is the standalone command-line tool for extracting clean content from websites.

To embed extraction in your own Node.js code instead, see the npm library page.

Install

npm install contextractor
npx playwright install chromium

Requires Node 22+. Playwright Chromium is needed for browser-based crawling; for the firefox crawler type, run npx playwright install firefox instead (a separate browser binary).

Quick start

Extract a page, then export the stored content to a folder of human-named files:

npx contextractor extract https://en.wikipedia.org/wiki/Web_scraping
npx contextractor export

extract saves each page to local storage (the default key-value store, indexed by a dataset record). export reads that index and writes one file per saved format per successful page, plus a manifest.json. For a single page with no storage involved, pipe extract-one straight to stdout:

npx contextractor extract-one https://en.wikipedia.org/wiki/Web_scraping | less

Subcommands

The CLI has four subcommands: extract, extract-one, export, and purge. The reference examples below invoke contextractor directly; with the local install above, run them via npx contextractor … (or install globally with npm install -g contextractor).

`extract`

contextractor extract [URLS...]

Extract content from one or more URLs and save to storage. A dataset record is pushed for every successful page (status: 'success'); failed requests are pushed with status: 'failed', and skipped URLs can be recorded with --store-skipped-urls. The CLI exits with code 2 when at least one request fails after retries.

contextractor extract https://example.com --mode precision --save json-kvs
contextractor extract https://a.com https://b.com --save txt-kvs
contextractor extract --start-urls-file urls.txt --storage ./my-archive

`extract-one`

contextractor extract-one <url>

Extract a single URL (no link-following) and write the content to file(s) and/or stdout — no storage involved, nothing is persisted. With no --save it prints markdown to stdout (markdown-stdout); all logs and progress go to stderr, so stdout stays clean and pipeable.

contextractor extract-one https://example.com/ | less
contextractor extract-one https://example.com/ --save txt-stdout > body.txt

# → report.md
contextractor extract-one https://example.com/ \
  --save markdown-file -o report

# → out/page.md + out/page.json
contextractor extract-one https://example.com/ \
  --save markdown-file --save json-file -o out/page

--save <token> — repeatable format-destination token; format txt|markdown|json|html|original, destination file|stdout (default: markdown-stdout). At most one format may target stdout
-o, --output <path> — file path for -file tokens: a literal path for one format, a base prefix for several, or a directory (trailing slash or an existing dir) for URL-slug names

All single-page flags from extract (--crawler-type, --proxy, --mode, --wait-for-selector, --cookies, …) work here too; the crawl and storage flags (--globs, --max-crawl-depth, --storage, --session-pool-name, …) belong to extract only. Exits 0 on success, 1 on failure, and 2 when the page was extracted but a requested format yielded no content.

`export`

contextractor export

Export stored extraction content to a user-facing output directory. Reads the dataset record index and, for every success record, writes one file per saved format — using the inline content or fetching the key-value-store blob by key. File names are derived from the record title (then its URL, then page), and a manifest.json listing every record (including failed and skipped) is written alongside the files.

contextractor export                 # → ./contextractor-output
contextractor export --output-dir ./out --storage ./my-archive

`purge`

contextractor purge

Clears the storage at --storage — the datasets/, key_value_stores/, and request_queues/ dirs. Same as running extract --purge before a crawl. This permanently deletes those storage buckets, with no confirmation prompt.

contextractor purge                          # purge the resolved storage dir
contextractor purge --storage ./my-archive   # purge a specific storage dir

Command reference

Storage

Option	Description
`--storage <path>`	Storage directory holding the datasets/key_value_stores/request_queues (default: `./storage` or the XDG data dir). One `--storage` path fully identifies a run's storage
`--purge`	Purge the storage at `--storage` before extracting (datasets, KVS, request queues)
`--start-urls-file <path>`	Read start URLs (one per line) from a file
`--config-file`, `-c <path>`	Path to JSON config file

The export command accepts --output-dir <path> (default ./contextractor-output) and --storage <path>.

Crawl settings

Option	Description
`--crawler-type <type>`	Crawler engine: `adaptive` (default), `firefox`, `chromium`, `cheerio`
`--rendering-type-detection <ratio>`	Rendering type detection ratio 0–1 (adaptive only, e.g. `0.1`)
`--max-requests-per-crawl <n>`	Max requests to handle (0 = unlimited)
`--max-crawl-depth <n>`	Max link depth from start URLs (0 = unlimited)
`--max-results <n>`	Max results per crawl (0 = unlimited)
`--initial-concurrency <n>`	Initial parallel requests (0 = Crawlee default)
`--max-concurrency <n>`	Max parallel requests (default: 3)
`--max-retries <n>`	Max request retries (default: 3)

Crawl filtering

Option	Description
`--globs <pattern>`	Glob pattern to include (repeatable)
`--exclude <pattern>`	Glob pattern to exclude (repeatable)
`--selector <css>`	CSS selector for links to follow
`--keep-url-fragment`	Preserve URL fragments
`--use-sitemaps`	Discover and enqueue URLs from sitemap.xml at each start URL domain root
`--respect-robots-txt`	Honor robots.txt
`--deduplication <level>`	Deduplication level: `minimal`, `standard` (default), or `aggressive`

Browser

Option	Description
`--headless` / `--no-headless`	Browser headless mode (default: headless)
`--wait-until <event>`	Page load event: `load`, `domcontentloaded`, `networkidle`, `commit`
`--navigation-timeout <secs>`	Navigation timeout in seconds (default: 60)
`--wait-for-dynamic-content <secs>`	Maximum seconds to wait for dynamic content after navigation; the crawler continues as soon as the network is idle or this timeout elapses, whichever comes first (0 = disabled)
`--wait-for-selector <css>`	CSS selector to wait for before extracting (fails on timeout)
`--soft-wait-for-selector <css>`	CSS selector to wait for before extracting (continues on timeout)
`--block-media` / `--no-block-media`	Block images, stylesheets, fonts, PDFs, and ZIPs (default: on)
`--ignore-cors-and-csp`	Disable CORS/CSP restrictions
`--ignore-https-errors`	Skip SSL certificate verification
`--close-cookie-modals` / `--no-close-cookie-modals`	Auto-dismiss cookie banners (default: on)
`--max-scroll-height <px>`	Max scroll height in pixels
`--user-agent <ua>`	Custom User-Agent string

Proxy & sessions

Option	Description
`--proxy <url>`	Proxy URL (repeatable)
`--proxy-rotation <strategy>`	Rotation: `recommended`, `per-request`, `until-failure`
`--session-pool-name <name>`	Named session pool for cross-run session sharing
`--max-session-rotations <n>`	Max session rotations per request on block detection (default: 10)

Cookies & headers

Option	Description
`--cookies <json>`	JSON array of cookie objects
`--headers <json>`	JSON object of custom HTTP headers

Output & extraction

Option	Description
`--save <token>`	Format-destination token, repeatable: `{txt,markdown,json,html,original}-{dataset,kvs}` (default `markdown-kvs`). List a format twice to save it to both destinations. Saving `original`/`html` to the dataset risks OOM on large pages
`--mode <mode>`	Extraction mode: `precision` (less noise), `balanced` (default), `recall` (more content)
`--language <lang>`	Filter by language (e.g. `en`)
`--no-links`	Exclude links from output
`--no-comments`	Exclude comments from output
`--no-tables`	Exclude tables from output
`--images` / `--no-images`	Include image alt text and captions (default: off)
`--store-skipped-urls`	Push skipped URL records to the dataset after crawl
`--verbose`, `-v`	Enable verbose logging

JSON config

Pass --config-file path/to/config.json. Keys use the same camelCase shape as the Apify Actor input. Orchestration flags (--proxy, --purge, --storage) are CLI-only and must be set on the command line.

contextractor extract --config-file config.json --max-requests-per-crawl 10

{
  "startUrls": [{ "url": "https://example.com" }],
  "headless": false,
  "maxRequestsPerCrawl": 10,
  "mode": "recall",
  "includeImages": true,
  "save": ["txt-dataset"]
}

Config merge order: schema defaults → config file → explicit CLI args. Unknown keys are stripped. The datasetName, keyValueStoreName, and requestQueueName fields apply only to the Apify Actor — the CLI parses but ignores them and always uses the default storage buckets under --storage.

Storage directory resolution

The storage directory is resolved in this order (first match wins):

--storage CLI flag
CONTEXTRACTOR_STORAGE_DIR env var
CRAWLEE_STORAGE_DIR env var (Crawlee native compatibility)
./storage if .actor/ or ./storage/ exists in the current working directory
${XDG_DATA_HOME:-~/.local/share}/contextractor/storage (XDG fallback)

Where to go next

Help hub — the index of every Contextractor help page.
Playground — enter a URL and preview extraction results in your browser.
Apify Actor — run extraction at scale on the Apify platform.
npm library — embed extraction in your own Node.js application.
PyPI package — extract from Python with the contextractor PyPI wrapper.

Updated: July 5, 2026