CLI

Install via npm:

npm install -g contextractor

Extract content from a URL:

contextractor https://example.com

Command reference

contextractor [OPTIONS] [URLS...]

Crawl settings

OptionDescription
--config, -cPath to JSON config file (optional)
--output-dir, -oOutput directory
--format, -fOutput format: txt, markdown, json, jsonl, xml, xmltei
--max-pagesMax pages to crawl (0 = unlimited)
--crawl-depthMax link depth from start URLs (0 = start only)
--headless / --no-headlessBrowser headless mode (default: headless)
--max-concurrencyMax parallel requests (default: 50)
--max-retriesMax request retries (default: 3)
--max-resultsMax results per crawl (0 = unlimited)

Proxy

OptionDescription
--proxy-urlsComma-separated proxy URLs (http://user:pass@host:port)
--proxy-rotationRotation: recommended, per_request, until_failure

Browser

OptionDescription
--launcherBrowser engine: chromium, firefox (default: chromium)
--wait-untilPage load event: load, networkidle, domcontentloaded (default: load)
--page-load-timeoutTimeout in seconds (default: 60)
--ignore-corsDisable CORS/CSP restrictions
--close-cookie-modalsAuto-dismiss cookie banners
--max-scroll-heightMax scroll height in pixels (default: 5000)
--ignore-ssl-errorsSkip SSL certificate verification
--user-agentCustom User-Agent string

Crawl filtering

OptionDescription
--globsComma-separated glob patterns to include
--excludesComma-separated glob patterns to exclude
--link-selectorCSS selector for links to follow
--keep-url-fragmentsPreserve URL fragments
--respect-robots-txtHonor robots.txt

Cookies & headers

OptionDescription
--cookiesJSON array of cookie objects
--headersJSON object of custom HTTP headers

Output toggles

OptionDescription
--save-raw-htmlSave raw HTML to output
--save-textSave extracted text
--save-jsonSave extracted JSON
--save-xmlSave extracted XML
--save-xml-teiSave extracted XML-TEI

Content extraction

OptionDescription
--precisionHigh precision mode (less noise)
--recallHigh recall mode (more content)
--fastFast extraction mode (less thorough)
--no-linksExclude links from output
--no-commentsExclude comments from output
--include-tables / --no-tablesInclude tables (default: include)
--include-imagesInclude image descriptions
--include-formatting / --no-formattingPreserve formatting (default: preserve)
--deduplicateDeduplicate extracted content
--target-languageFilter by language (e.g. "en")
--with-metadata / --no-metadataExtract metadata (default: with)
--prune-xpathXPath patterns to remove from content

Diagnostics

OptionDescription
--verbose, -vEnable verbose logging

Config file

Instead of passing all options on the command line, you can use a JSON config file:

contextractor --config config.json
{
  "urls": ["https://example.com", "https://docs.example.com"],
  "outputFormat": "markdown",
  "outputDir": "./output",
  "crawlDepth": 1,
  "proxy": {
    "urls": ["http://user:pass@host:port"],
    "rotation": "recommended"
  },
  "launcher": "chromium",
  "waitUntil": "load",
  "globs": ["https://example.com/blog/**"],
  "excludes": ["https://example.com/blog/archive/**"],
  "maxConcurrency": 50,
  "maxRetries": 3,
  "cookies": [{"name": "session", "value": "abc", "domain": ".example.com"}],
  "headers": {"Authorization": "Bearer token"},
  "saveRawHtml": false,
  "saveText": false,
  "extraction": {
    "favorPrecision": true,
    "includeLinks": true,
    "includeTables": true,
    "deduplicate": true
  }
}

Crawl settings

FieldTypeDefaultDescription
urlsarray[]URLs to extract
maxPagesint0Max pages (0 = unlimited)
outputFormatstring"markdown"Output format
outputDirstring"./output"Output directory
crawlDepthint0Link follow depth
headlessbooltrueBrowser headless mode
maxConcurrencyint50Max parallel requests
maxRetriesint3Max request retries
maxResultsint0Max results (0 = unlimited)

Proxy

FieldTypeDefaultDescription
proxy.urlsarray[]Proxy URLs
proxy.rotationstring"recommended"recommended, per_request, until_failure
proxy.tieredarray[]Tiered proxy escalation (config-file only)

Browser

FieldTypeDefaultDescription
launcherstring"chromium"chromium or firefox
waitUntilstring"load"load, networkidle, domcontentloaded
pageLoadTimeoutint60Timeout in seconds
ignoreCorsboolfalseDisable CORS/CSP
closeCookieModalsboolfalseAuto-dismiss cookie banners
maxScrollHeightint5000Max scroll pixels (0 = disable)
ignoreSslErrorsboolfalseSkip SSL verification
userAgentstring""Custom User-Agent string

Crawl filtering

FieldTypeDefaultDescription
globsarray[]Glob patterns to include
excludesarray[]Glob patterns to exclude
linkSelectorstring""CSS selector for links
keepUrlFragmentsboolfalsePreserve URL fragments
respectRobotsTxtboolfalseHonor robots.txt

Cookies & headers

FieldTypeDefaultDescription
cookiesarray[]Initial cookies
headersobject{}Custom HTTP headers

Output toggles

FieldTypeDefaultDescription
saveRawHtmlboolfalseSave raw HTML
saveTextboolfalseSave plain text
saveJsonboolfalseSave JSON
saveXmlboolfalseSave XML
saveXmlTeiboolfalseSave XML-TEI

Extraction options

Used in config files under the extraction key. All have equivalent CLI flags.

FieldTypeDefaultDescription
favorPrecisionboolfalseHigh precision, less noise
favorRecallboolfalseHigh recall, more content
fastboolfalseFast mode (less thorough)
includeCommentsbooltrueInclude comments
includeTablesbooltrueInclude tables
includeImagesboolfalseInclude images
includeFormattingbooltruePreserve formatting
includeLinksbooltrueInclude links
deduplicateboolfalseDeduplicate content
withMetadatabooltrueExtract metadata
targetLanguagestringnullFilter by language
pruneXpatharraynullXPath patterns to remove

CLI flags override config file settings. Merge order: defaults → config file → CLI args

Updated: March 26, 2026