Apify Actor

Contextractor is available as an Apify actor for extracting content from multiple URLs, crawling entire websites, and running on a schedule. Results are stored in Apify datasets and key-value stores, accessible via API.

Quick start

Open the actor page on Apify, add your start URLs, and click Start.

From the command line:

apify call glueo/contextractor --input='{"startUrls": [{"url": "https://example.com"}]}'

Or via the Apify API:

curl -X POST "https://api.apify.com/v2/acts/glueo~contextractor/runs" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"startUrls": [{"url": "https://example.com"}]}'

Input reference

Crawl settings

Parameter	Type	Default	Description
`startUrls`	array	required	URLs to extract content from
`maxPagesPerCrawl`	integer	0	Maximum pages to crawl (0 = unlimited)
`maxCrawlingDepth`	integer	0	Maximum link depth from start URLs (0 = unlimited)
`maxConcurrency`	integer	50	Maximum parallel browser pages
`maxRequestRetries`	integer	3	Retries for failed requests
`maxResultsPerCrawl`	integer	0	Maximum results saved to dataset (0 = unlimited)

Crawl filtering

Parameter	Type	Default	Description
`globs`	array	[]	Glob patterns matching URLs to include
`excludes`	array	[]	Glob patterns matching URLs to exclude
`pseudoUrls`	array	[]	Pseudo-URL patterns (alternative to globs)
`linkSelector`	string	""	CSS selector for links to follow
`keepUrlFragments`	boolean	false	Treat URLs with different fragments as different pages
`respectRobotsTxtFile`	boolean	false	Honor robots.txt rules

Content extraction

Extraction settings are passed as a JSON object in trafilaturaConfig. Leave empty for balanced defaults.

Key	Type	Default	Description
`favorPrecision`	boolean	false	High precision, less noise
`favorRecall`	boolean	false	High recall, more content
`fast`	boolean	false	Fast mode (less thorough)
`includeComments`	boolean	true	Include page comments
`includeTables`	boolean	true	Include tables
`includeImages`	boolean	false	Include image descriptions
`includeFormatting`	boolean	true	Preserve formatting
`includeLinks`	boolean	true	Include links
`deduplicate`	boolean	false	Deduplicate extracted content
`withMetadata`	boolean	true	Extract page metadata
`targetLanguage`	string	null	Filter by language (e.g. "en")
`pruneXpath`	array	null	XPath patterns to remove from content

Example:

{
  "trafilaturaConfig": {
    "favorPrecision": true,
    "includeTables": true,
    "deduplicate": true,
    "targetLanguage": "en"
  }
}

Output settings

Parameter	Type	Default	Description
`saveExtractedMarkdownToKeyValueStore`	boolean	true	Save extracted Markdown
`saveRawHtmlToKeyValueStore`	boolean	false	Save raw HTML
`saveExtractedTextToKeyValueStore`	boolean	false	Save extracted plain text
`saveExtractedJsonToKeyValueStore`	boolean	false	Save extracted JSON with metadata
`saveExtractedXmlToKeyValueStore`	boolean	false	Save extracted XML
`saveExtractedXmlTeiToKeyValueStore`	boolean	false	Save extracted XML-TEI (scholarly format)
`datasetName`	string		Named dataset for results (empty = default run dataset)
`keyValueStoreName`	string		Named key-value store for content files (empty = default)
`requestQueueName`	string		Named request queue for pending URLs (empty = default)

Browser settings

Parameter	Type	Default	Description
`launcher`	string	CHROMIUM	Browser engine: CHROMIUM or FIREFOX
`headless`	boolean	true	Run browser in headless mode
`waitUntil`	string	LOAD	Navigation event: LOAD, NETWORKIDLE, DOMCONTENTLOADED
`pageLoadTimeoutSecs`	integer	60	Page load timeout in seconds
`ignoreCorsAndCsp`	boolean	false	Disable CORS/CSP restrictions
`closeCookieModals`	boolean	true	Auto-dismiss cookie consent banners
`maxScrollHeightPixels`	integer	5000	Max scroll height in pixels (0 = disable)
`userAgent`	string	""	Custom User-Agent string
`ignoreSslErrors`	boolean	false	Skip SSL certificate verification

Proxy

Parameter	Type	Default	Description
`proxyConfiguration`	object		Apify proxy settings (use the proxy editor in Console)
`proxyRotation`	string	RECOMMENDED	Rotation: RECOMMENDED, PER_REQUEST, UNTIL_FAILURE

Cookies & headers

Parameter	Type	Default	Description
`initialCookies`	array	[]	Cookies pre-set on all pages (JSON array of cookie objects)
`customHttpHeaders`	object	{}	Custom HTTP headers added to all requests

Diagnostics

Parameter	Type	Default	Description
`debugLog`	boolean	false	Include debug messages in log output
`browserLog`	boolean	false	Include browser console messages in log

Output

Results are stored in an Apify dataset. Each item contains page metadata and references to extracted content files in the key-value store:

{
  "loadedUrl": "https://example.com/article",
  "httpStatus": 200,
  "loadedAt": "2026-04-11T12:00:00.000Z",
  "metadata": {
    "title": "Article Title",
    "author": "Author Name",
    "publishedAt": "2026-01-15",
    "description": "Article description",
    "siteName": "Example",
    "lang": "en"
  },
  "extractedMarkdown": {
    "key": "example-com-article_markdown",
    "url": "https://api.apify.com/v2/key-value-stores/STORE_ID/records/example-com-article_markdown",
    "hash": "a1b2c3d4",
    "length": 4523
  }
}

Extracted content files (Markdown, text, JSON, XML) are stored in the key-value store. Access them via the URL in each dataset item or download them from the Apify Console.

Example

Crawl all blog posts from a site with glob filtering:

{
  "startUrls": [{"url": "https://example.com/blog"}],
  "globs": ["https://example.com/blog/**"],
  "excludes": ["https://example.com/blog/archive/**"],
  "maxPagesPerCrawl": 100,
  "maxCrawlingDepth": 2,
  "trafilaturaConfig": {
    "favorPrecision": true,
    "deduplicate": true
  },
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}

CLI and Docker alternatives

The same extraction engine is available as npm package and Docker for local use. All extraction options work the same across all interfaces.

Updated: April 11, 2026