CLI

Install via npm:

npm install -g contextractor

Extract content from a URL:

contextractor https://example.com

Command reference

contextractor [OPTIONS] [URLS...]

Crawl settings

Option	Description
`--config`, `-c`	Path to JSON config file (optional)
`--output-dir`, `-o`	Output directory
`--format`, `-f`	Output format: txt, markdown, json, jsonl, xml, xmltei
`--max-pages`	Max pages to crawl (0 = unlimited)
`--crawl-depth`	Max link depth from start URLs (0 = start only)
`--headless` / `--no-headless`	Browser headless mode (default: headless)
`--max-concurrency`	Max parallel requests (default: 50)
`--max-retries`	Max request retries (default: 3)
`--max-results`	Max results per crawl (0 = unlimited)

Proxy

Option	Description
`--proxy-urls`	Comma-separated proxy URLs (http://user:pass@host:port)
`--proxy-rotation`	Rotation: recommended, per_request, until_failure

Browser

Option	Description
`--launcher`	Browser engine: chromium, firefox (default: chromium)
`--wait-until`	Page load event: load, networkidle, domcontentloaded (default: load)
`--page-load-timeout`	Timeout in seconds (default: 60)
`--ignore-cors`	Disable CORS/CSP restrictions
`--close-cookie-modals`	Auto-dismiss cookie banners
`--max-scroll-height`	Max scroll height in pixels (default: 5000)
`--ignore-ssl-errors`	Skip SSL certificate verification
`--user-agent`	Custom User-Agent string

Crawl filtering

Option	Description
`--globs`	Comma-separated glob patterns to include
`--excludes`	Comma-separated glob patterns to exclude
`--link-selector`	CSS selector for links to follow
`--keep-url-fragments`	Preserve URL fragments
`--respect-robots-txt`	Honor robots.txt

Cookies & headers

Option	Description
`--cookies`	JSON array of cookie objects
`--headers`	JSON object of custom HTTP headers

Output toggles

Option	Description
`--save-raw-html`	Save raw HTML to output
`--save-text`	Save extracted text
`--save-json`	Save extracted JSON
`--save-xml`	Save extracted XML
`--save-xml-tei`	Save extracted XML-TEI

Content extraction

Option	Description
`--precision`	High precision mode (less noise)
`--recall`	High recall mode (more content)
`--fast`	Fast extraction mode (less thorough)
`--no-links`	Exclude links from output
`--no-comments`	Exclude comments from output
`--include-tables` / `--no-tables`	Include tables (default: include)
`--include-images`	Include image descriptions
`--include-formatting` / `--no-formatting`	Preserve formatting (default: preserve)
`--deduplicate`	Deduplicate extracted content
`--target-language`	Filter by language (e.g. "en")
`--with-metadata` / `--no-metadata`	Extract metadata (default: with)
`--prune-xpath`	XPath patterns to remove from content

Diagnostics

Option	Description
`--verbose`, `-v`	Enable verbose logging

Config file

Instead of passing all options on the command line, you can use a JSON config file:

contextractor --config config.json

{
  "urls": ["https://example.com", "https://docs.example.com"],
  "outputFormat": "markdown",
  "outputDir": "./output",
  "crawlDepth": 1,
  "proxy": {
    "urls": ["http://user:pass@host:port"],
    "rotation": "recommended"
  },
  "launcher": "chromium",
  "waitUntil": "load",
  "globs": ["https://example.com/blog/**"],
  "excludes": ["https://example.com/blog/archive/**"],
  "maxConcurrency": 50,
  "maxRetries": 3,
  "cookies": [{"name": "session", "value": "abc", "domain": ".example.com"}],
  "headers": {"Authorization": "Bearer token"},
  "saveRawHtml": false,
  "saveText": false,
  "extraction": {
    "favorPrecision": true,
    "includeLinks": true,
    "includeTables": true,
    "deduplicate": true
  }
}

Crawl settings

Field	Type	Default	Description
`urls`	array	[]	URLs to extract
`maxPages`	int	0	Max pages (0 = unlimited)
`outputFormat`	string	"markdown"	Output format
`outputDir`	string	"./output"	Output directory
`crawlDepth`	int	0	Link follow depth
`headless`	bool	true	Browser headless mode
`maxConcurrency`	int	50	Max parallel requests
`maxRetries`	int	3	Max request retries
`maxResults`	int	0	Max results (0 = unlimited)

Proxy

Field	Type	Default	Description
`proxy.urls`	array	[]	Proxy URLs
`proxy.rotation`	string	"recommended"	recommended, per_request, until_failure
`proxy.tiered`	array	[]	Tiered proxy escalation (config-file only)

Browser

Field	Type	Default	Description
`launcher`	string	"chromium"	chromium or firefox
`waitUntil`	string	"load"	load, networkidle, domcontentloaded
`pageLoadTimeout`	int	60	Timeout in seconds
`ignoreCors`	bool	false	Disable CORS/CSP
`closeCookieModals`	bool	false	Auto-dismiss cookie banners
`maxScrollHeight`	int	5000	Max scroll pixels (0 = disable)
`ignoreSslErrors`	bool	false	Skip SSL verification
`userAgent`	string	""	Custom User-Agent string

Crawl filtering

Field	Type	Default	Description
`globs`	array	[]	Glob patterns to include
`excludes`	array	[]	Glob patterns to exclude
`linkSelector`	string	""	CSS selector for links
`keepUrlFragments`	bool	false	Preserve URL fragments
`respectRobotsTxt`	bool	false	Honor robots.txt

Cookies & headers

Field	Type	Default	Description
`cookies`	array	[]	Initial cookies
`headers`	object	{}	Custom HTTP headers

Output toggles

Field	Type	Default	Description
`saveRawHtml`	bool	false	Save raw HTML
`saveText`	bool	false	Save plain text
`saveJson`	bool	false	Save JSON
`saveXml`	bool	false	Save XML
`saveXmlTei`	bool	false	Save XML-TEI

Extraction options

Used in config files under the extraction key. All have equivalent CLI flags.

Field	Type	Default	Description
`favorPrecision`	bool	false	High precision, less noise
`favorRecall`	bool	false	High recall, more content
`fast`	bool	false	Fast mode (less thorough)
`includeComments`	bool	true	Include comments
`includeTables`	bool	true	Include tables
`includeImages`	bool	false	Include images
`includeFormatting`	bool	true	Preserve formatting
`includeLinks`	bool	true	Include links
`deduplicate`	bool	false	Deduplicate content
`withMetadata`	bool	true	Extract metadata
`targetLanguage`	string	null	Filter by language
`pruneXpath`	array	null	XPath patterns to remove

CLI flags override config file settings. Merge order: defaults → config file → CLI args

Updated: March 26, 2026