CLI

Install via npm:

npm install -g contextractor

For pip installation and the Python library API, see PyPI package.

Extract content from a URL:

contextractor https://example.com

Command reference

contextractor [OPTIONS] [URLS...]

Crawl settings

OptionDescription
--config, -cPath to JSON config file (optional)
--output-dir, -oOutput directory
--max-pagesMax pages to crawl (0 = unlimited)
--crawl-depthMax link depth from start URLs (0 = start only)
--headless / --no-headlessBrowser headless mode (default: headless)
--max-concurrencyMax parallel requests (default: 50)
--max-retriesMax request retries (default: 3)
--max-resultsMax results per crawl (0 = unlimited)

Proxy

OptionDescription
--proxy-urlsComma-separated proxy URLs (http://user:pass@host:port)
--proxy-rotationRotation: recommended, per_request, until_failure

Browser

OptionDescription
--launcherBrowser engine: chromium, firefox (default: chromium)
--wait-untilPage load event: load, networkidle, domcontentloaded (default: load)
--page-load-timeoutTimeout in seconds (default: 60)
--ignore-corsDisable CORS/CSP restrictions
--close-cookie-modalsAuto-dismiss cookie banners
--max-scroll-heightMax scroll height in pixels (default: 5000)
--ignore-ssl-errorsSkip SSL certificate verification
--user-agentCustom User-Agent string

Crawl filtering

OptionDescription
--globsComma-separated glob patterns to include
--excludesComma-separated glob patterns to exclude
--link-selectorCSS selector for links to follow
--keep-url-fragmentsPreserve URL fragments
--respect-robots-txtHonor robots.txt

Cookies & headers

OptionDescription
--cookiesJSON array of cookie objects
--headersJSON object of custom HTTP headers

Output formats

OptionDescription
--saveOutput formats, comma-separated: markdown, html, text, json, jsonl, xml, xml-tei, all (default: markdown)

Content extraction

OptionDescription
--precisionHigh precision mode (less noise)
--recallHigh recall mode (more content)
--fastFast extraction mode (less thorough)
--no-linksExclude links from output
--no-commentsExclude comments from output
--include-tables / --no-tablesInclude tables (default: include)
--include-imagesInclude image descriptions
--include-formatting / --no-formattingPreserve formatting (default: preserve)
--deduplicateDeduplicate extracted content
--target-languageFilter by language (e.g. "en")
--with-metadata / --no-metadataExtract metadata (default: with)
--prune-xpathXPath patterns to remove from content

Diagnostics

OptionDescription
--verbose, -vEnable verbose logging

Config file

Instead of passing all options on the command line, you can use a JSON config file:

contextractor --config config.json
{
  "urls": ["https://example.com", "https://docs.example.com"],
  "outputDir": "./output",
  "save": ["markdown", "json"],
  "crawlDepth": 1,
  "proxy": {
    "urls": ["http://user:pass@host:port"],
    "rotation": "recommended"
  },
  "launcher": "chromium",
  "waitUntil": "load",
  "globs": ["https://example.com/blog/**"],
  "excludes": ["https://example.com/blog/archive/**"],
  "maxConcurrency": 50,
  "maxRequestRetries": 3,
  "cookies": [{"name": "session", "value": "abc", "domain": ".example.com"}],
  "headers": {"Authorization": "Bearer token"},
  "trafilaturaConfig": {
    "favorPrecision": true,
    "includeLinks": true,
    "includeTables": true,
    "deduplicate": true
  }
}

Crawl settings

FieldTypeDefaultDescription
urlsarray[]URLs to extract
maxPagesint0Max pages (0 = unlimited)
outputDirstring"./output"Output directory
crawlDepthint0Link follow depth
headlessbooltrueBrowser headless mode
maxConcurrencyint50Max parallel requests
maxRequestRetriesint3Max request retries
maxResultsPerCrawlint0Max results (0 = unlimited)

Proxy

FieldTypeDefaultDescription
proxy.urlsarray[]Proxy URLs
proxy.rotationstring"recommended"recommended, per_request, until_failure
proxy.tieredarray[]Tiered proxy escalation (config-file only)

Browser

FieldTypeDefaultDescription
launcherstring"chromium"chromium or firefox
waitUntilstring"load"load, networkidle, domcontentloaded
pageLoadTimeoutint60Timeout in seconds
ignoreCorsboolfalseDisable CORS/CSP
closeCookieModalsbooltrueAuto-dismiss cookie banners
maxScrollHeightint5000Max scroll pixels (0 = disable)
ignoreSslErrorsboolfalseSkip SSL verification
userAgentstring""Custom User-Agent string

Crawl filtering

FieldTypeDefaultDescription
globsarray[]Glob patterns to include
excludesarray[]Glob patterns to exclude
linkSelectorstring""CSS selector for links
keepUrlFragmentsboolfalsePreserve URL fragments
respectRobotsTxtboolfalseHonor robots.txt

Cookies & headers

FieldTypeDefaultDescription
cookiesarray[]Initial cookies
headersobject{}Custom HTTP headers

Output formats

FieldTypeDefaultDescription
savearray["markdown"]Output formats: markdown, html, text, json, jsonl, xml, xml-tei, all

Extraction options

Used in config files under the trafilaturaConfig key. All have equivalent CLI flags.

FieldTypeDefaultDescription
favorPrecisionboolfalseHigh precision, less noise
favorRecallboolfalseHigh recall, more content
fastboolfalseFast mode (less thorough)
includeCommentsbooltrueInclude comments
includeTablesbooltrueInclude tables
includeImagesboolfalseInclude images
includeFormattingbooltruePreserve formatting
includeLinksbooltrueInclude links
deduplicateboolfalseDeduplicate content
withMetadatabooltrueExtract metadata
targetLanguagestringnullFilter by language
onlyWithMetadataboolfalseOnly keep content with metadata
teiValidationboolfalseValidate TEI output
pruneXpatharraynullXPath patterns to remove
urlBlacklistarraynullURLs to exclude from extraction (config file only)
authorBlacklistarraynullAuthors to exclude from extraction (config file only)
dateExtractionParamsobjectnullCustom date extraction parameters (config file only)

CLI flags override config file settings. Merge order: defaults → config file → CLI args

Updated: April 14, 2026