npm CLI

contextractor is the standalone command-line tool for extracting clean content from websites. It is built on the Rust port of Trafilatura (extraction) and Crawlee (TypeScript crawler driving Playwright).

To embed extraction in your own Node.js code instead, see the npm library page.

Install

npm install contextractor
npx playwright install chromium

Requires Node 22+. Playwright Chromium is needed for browser-based crawling.

Quick start

Extract a page, then export the stored content to a folder of human-named files:

npx contextractor extract https://example.com
npx contextractor export --output-dir ./contextractor-output

extract saves each page to local storage (the default key-value store, indexed by a dataset record). export reads that index and writes one file per saved format per successful page, plus a manifest.json.

Subcommands

The CLI has three subcommands: extract, export, and purge.

extract

contextractor extract [URLS...]

Extract content from one or more URLs and save to storage. A dataset record is pushed for every successful page (status: 'success'); failed requests are pushed with status: 'failed', and skipped URLs can be recorded with --store-skipped-urls. The CLI exits with code 2 when at least one request fails after retries.

contextractor extract https://example.com --mode precision --save json
contextractor extract https://a.com https://b.com --save txt
contextractor extract --input-file urls.txt --dataset my-archive

export

contextractor export

Export stored extraction content to a user-facing output directory. Reads the dataset record index and, for every success record, writes one file per saved format — using the inline content or fetching the key-value-store blob by key. File names are derived from the record title (then its URL, then page), and a manifest.json listing every record (including failed and skipped) is written alongside the files.

contextractor export                                  # → ./contextractor-output
contextractor export --output-dir ./out --dataset my-archive

purge

contextractor purge        # purge default dataset and key-value store
contextractor purge --all  # purge all datasets and key-value stores

Command reference

Storage

OptionDescription
--dataset <name>Route output to a named dataset (default: default)
--key-value-store <name>Route content blobs to a named key-value store (default: default)
--request-queue <name>Route pending URLs to a named request queue
--save-destination <dest>Where to save, repeatable: key-value-store (default) or dataset
--storage-dir <path>Override Crawlee storage directory
--input-file <file>Read URLs (one per line) from a file
--config, -c <path>Path to JSON config file
--cleanPurge default storage before extracting

The export command accepts --output-dir <path> (default ./contextractor-output), --dataset <name> (default default), --key-value-store <name> (default default), and --storage-dir <path>. There is no --request-queue on export — queues hold pending URLs, not content.

Crawl settings

OptionDescription
--crawler-type <type>Crawler engine: adaptive (default), firefox, chromium, cheerio
--rendering-type-detection <ratio>Rendering type detection ratio 0–1 (adaptive only, e.g. 0.1)
--max-requests-per-crawl <n>Max requests to handle (0 = unlimited)
--max-crawl-depth <n>Max link depth from start URLs (0 = start only)
--max-results <n>Max results per crawl (0 = unlimited)
--initial-concurrency <n>Initial parallel requests (0 = Crawlee default)
--max-concurrency <n>Max parallel requests
--max-retries <n>Max request retries

Crawl filtering

OptionDescription
--globs <pattern>Glob pattern to include (repeatable)
--exclude <pattern>Glob pattern to exclude (repeatable)
--selector <css>CSS selector for links to follow
--keep-url-fragmentPreserve URL fragments
--use-sitemapsDiscover and enqueue URLs from sitemap.xml at each start URL domain root
--respect-robots-txtHonor robots.txt
--deduplication <level>Deduplication level: none, url (default), or content-hash

Browser

OptionDescription
--headless / --no-headlessBrowser headless mode (default: headless)
--wait-until <event>Page load event: load, domcontentloaded, networkidle, commit
--navigation-timeout <secs>Page load timeout in seconds
--wait-for-dynamic-content <secs>Seconds to wait for network idle after navigation (0 = disabled)
--wait-for-selector <css>CSS selector to wait for before extracting (fails on timeout)
--soft-wait-for-selector <css>CSS selector to wait for before extracting (continues on timeout)
--block-media / --no-block-mediaBlock images, stylesheets, fonts, PDFs, and ZIPs (default: off)
--ignore-cors-and-cspDisable CORS/CSP restrictions
--ignore-https-errorsSkip SSL certificate verification
--close-cookie-modalsAuto-dismiss cookie banners
--max-scroll-height <px>Max scroll height in pixels
--user-agent <ua>Custom User-Agent string

Proxy & sessions

OptionDescription
--proxy <url>Proxy URL (repeatable)
--proxy-rotation <strategy>Rotation: recommended, per-request, until-failure
--session-pool-name <name>Named session pool for cross-run session sharing
--max-session-rotations <n>Max session rotations per request on block detection

Cookies & headers

OptionDescription
--cookies <json>JSON array of cookie objects
--headers <json>JSON object of custom HTTP headers

Output & extraction

OptionDescription
--save <format>Output format, repeatable: markdown (default), txt, json, html, original, all
--mode <mode>Extraction mode: precision (less noise), balanced (default), recall (more content)
--language <lang>Filter by language (e.g. en)
--no-linksExclude links from output
--no-commentsExclude comments from output
--no-tablesExclude tables from output
--images / --no-imagesInclude image alt text and captions (default: off)
--store-skipped-urlsPush skipped URL records to the dataset after crawl
--verbose, -vEnable verbose logging

JSON config

Pass --config path/to/config.json. Keys use the same camelCase shape as the Apify input schema. Orchestration flags (--proxy, --clean) are CLI-only and must be set on the command line.

contextractor extract --config config.json --max-requests-per-crawl 10
{
  "startUrls": [{ "url": "https://example.com" }],
  "headless": false,
  "maxRequestsPerCrawl": 10,
  "mode": "recall",
  "includeImages": true,
  "save": ["txt"],
  "saveDestination": ["dataset"],
  "datasetName": "my-archive"
}

Config merge order: schema defaults → config file → explicit CLI args. Unknown keys are stripped.

Storage directory resolution

The storage directory is resolved in this order (first match wins):

  • --storage-dir CLI flag
  • CONTEXTRACTOR_STORAGE_DIR env var
  • CRAWLEE_STORAGE_DIR env var (Crawlee native compatibility)
  • ./storage if .actor/ or ./storage/ exists in the current working directory
  • ${XDG_DATA_HOME:-~/.local/share}/contextractor/storage (XDG fallback)

Where to go next

  • npm library — embed extraction in your own Node.js application.
  • Apify Actor — run extraction at scale on the Apify platform.
  • Playground — paste HTML and preview extraction results in your browser.

Updated: June 3, 2026