Apify Actor

Contextractor is available as an Apify actor for extracting content from multiple URLs, crawling entire websites, and running on a schedule. Results are stored in Apify datasets and key-value stores, accessible via API.

Quick start

Open the actor page on Apify, add your start URLs, and click Start.

From the command line:

apify call glueo/contextractor --input='{"startUrls": [{"url": "https://example.com"}]}'

Or via the Apify API:

curl -X POST "https://api.apify.com/v2/acts/glueo~contextractor/runs" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"startUrls": [{"url": "https://example.com"}]}'

Input reference

Crawl settings

ParameterTypeDefaultDescription
startUrlsarrayrequiredURLs to extract content from
maxPagesPerCrawlinteger0Maximum pages to crawl (0 = unlimited)
maxCrawlingDepthinteger0Maximum link depth from start URLs (0 = unlimited)
maxConcurrencyinteger50Maximum parallel browser pages
maxRequestRetriesinteger3Retries for failed requests
maxResultsPerCrawlinteger0Maximum results saved to dataset (0 = unlimited)

Crawl filtering

ParameterTypeDefaultDescription
globsarray[]Glob patterns matching URLs to include
excludesarray[]Glob patterns matching URLs to exclude
pseudoUrlsarray[]Pseudo-URL patterns (alternative to globs)
linkSelectorstring""CSS selector for links to follow
keepUrlFragmentsbooleanfalseTreat URLs with different fragments as different pages
respectRobotsTxtFilebooleanfalseHonor robots.txt rules

Content extraction

Extraction settings are passed as a JSON object in trafilaturaConfig. Leave empty for balanced defaults.

KeyTypeDefaultDescription
favorPrecisionbooleanfalseHigh precision, less noise
favorRecallbooleanfalseHigh recall, more content
fastbooleanfalseFast mode (less thorough)
includeCommentsbooleantrueInclude page comments
includeTablesbooleantrueInclude tables
includeImagesbooleanfalseInclude image descriptions
includeFormattingbooleantruePreserve formatting
includeLinksbooleantrueInclude links
deduplicatebooleanfalseDeduplicate extracted content
withMetadatabooleantrueExtract page metadata
targetLanguagestringnullFilter by language (e.g. "en")
pruneXpatharraynullXPath patterns to remove from content

Example:

{
  "trafilaturaConfig": {
    "favorPrecision": true,
    "includeTables": true,
    "deduplicate": true,
    "targetLanguage": "en"
  }
}

Output settings

ParameterTypeDefaultDescription
saveExtractedMarkdownToKeyValueStorebooleantrueSave extracted Markdown
saveRawHtmlToKeyValueStorebooleanfalseSave raw HTML
saveExtractedTextToKeyValueStorebooleanfalseSave extracted plain text
saveExtractedJsonToKeyValueStorebooleanfalseSave extracted JSON with metadata
saveExtractedXmlToKeyValueStorebooleanfalseSave extracted XML
saveExtractedXmlTeiToKeyValueStorebooleanfalseSave extracted XML-TEI (scholarly format)
datasetNamestringNamed dataset for results (empty = default run dataset)
keyValueStoreNamestringNamed key-value store for content files (empty = default)
requestQueueNamestringNamed request queue for pending URLs (empty = default)

Browser settings

ParameterTypeDefaultDescription
launcherstringCHROMIUMBrowser engine: CHROMIUM or FIREFOX
headlessbooleantrueRun browser in headless mode
waitUntilstringLOADNavigation event: LOAD, NETWORKIDLE, DOMCONTENTLOADED
pageLoadTimeoutSecsinteger60Page load timeout in seconds
ignoreCorsAndCspbooleanfalseDisable CORS/CSP restrictions
closeCookieModalsbooleantrueAuto-dismiss cookie consent banners
maxScrollHeightPixelsinteger5000Max scroll height in pixels (0 = disable)
userAgentstring""Custom User-Agent string
ignoreSslErrorsbooleanfalseSkip SSL certificate verification

Proxy

ParameterTypeDefaultDescription
proxyConfigurationobjectApify proxy settings (use the proxy editor in Console)
proxyRotationstringRECOMMENDEDRotation: RECOMMENDED, PER_REQUEST, UNTIL_FAILURE

Cookies & headers

ParameterTypeDefaultDescription
initialCookiesarray[]Cookies pre-set on all pages (JSON array of cookie objects)
customHttpHeadersobject{}Custom HTTP headers added to all requests

Diagnostics

ParameterTypeDefaultDescription
debugLogbooleanfalseInclude debug messages in log output
browserLogbooleanfalseInclude browser console messages in log

Output

Results are stored in an Apify dataset. Each item contains page metadata and references to extracted content files in the key-value store:

{
  "loadedUrl": "https://example.com/article",
  "httpStatus": 200,
  "loadedAt": "2026-04-11T12:00:00.000Z",
  "metadata": {
    "title": "Article Title",
    "author": "Author Name",
    "publishedAt": "2026-01-15",
    "description": "Article description",
    "siteName": "Example",
    "lang": "en"
  },
  "extractedMarkdown": {
    "key": "example-com-article_markdown",
    "url": "https://api.apify.com/v2/key-value-stores/STORE_ID/records/example-com-article_markdown",
    "hash": "a1b2c3d4",
    "length": 4523
  }
}

Extracted content files (Markdown, text, JSON, XML) are stored in the key-value store. Access them via the URL in each dataset item or download them from the Apify Console.

Example

Crawl all blog posts from a site with glob filtering:

{
  "startUrls": [{"url": "https://example.com/blog"}],
  "globs": ["https://example.com/blog/**"],
  "excludes": ["https://example.com/blog/archive/**"],
  "maxPagesPerCrawl": 100,
  "maxCrawlingDepth": 2,
  "trafilaturaConfig": {
    "favorPrecision": true,
    "deduplicate": true
  },
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}

CLI and Docker alternatives

The same extraction engine is available as npm package and Docker for local use. All extraction options work the same across all interfaces.

Updated: April 11, 2026