Apify Actor
Contextractor is available as an Apify actor for extracting content from multiple URLs, crawling entire websites, and running on a schedule. Results are stored in Apify datasets and key-value stores, accessible via API.
Quick start
Open the actor page on Apify, add your start URLs, and click Start.
From the command line:
apify call glueo/contextractor --input='{"startUrls": [{"url": "https://example.com"}]}'
Or via the Apify API:
curl -X POST "https://api.apify.com/v2/acts/glueo~contextractor/runs" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"startUrls": [{"url": "https://example.com"}]}'
Input reference
Crawl settings
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | URLs to extract content from |
maxPagesPerCrawl | integer | 0 | Maximum pages to crawl (0 = unlimited) |
maxCrawlingDepth | integer | 0 | Maximum link depth from start URLs (0 = unlimited) |
maxConcurrency | integer | 50 | Maximum parallel browser pages |
maxRequestRetries | integer | 3 | Retries for failed requests |
maxResultsPerCrawl | integer | 0 | Maximum results saved to dataset (0 = unlimited) |
Crawl filtering
| Parameter | Type | Default | Description |
|---|---|---|---|
globs | array | [] | Glob patterns matching URLs to include |
excludes | array | [] | Glob patterns matching URLs to exclude |
pseudoUrls | array | [] | Pseudo-URL patterns (alternative to globs) |
linkSelector | string | "" | CSS selector for links to follow |
keepUrlFragments | boolean | false | Treat URLs with different fragments as different pages |
respectRobotsTxtFile | boolean | false | Honor robots.txt rules |
Content extraction
Extraction settings are passed as a JSON object in trafilaturaConfig. Leave empty for balanced defaults.
| Key | Type | Default | Description |
|---|---|---|---|
favorPrecision | boolean | false | High precision, less noise |
favorRecall | boolean | false | High recall, more content |
fast | boolean | false | Fast mode (less thorough) |
includeComments | boolean | true | Include page comments |
includeTables | boolean | true | Include tables |
includeImages | boolean | false | Include image descriptions |
includeFormatting | boolean | true | Preserve formatting |
includeLinks | boolean | true | Include links |
deduplicate | boolean | false | Deduplicate extracted content |
withMetadata | boolean | true | Extract page metadata |
targetLanguage | string | null | Filter by language (e.g. "en") |
pruneXpath | array | null | XPath patterns to remove from content |
Example:
{
"trafilaturaConfig": {
"favorPrecision": true,
"includeTables": true,
"deduplicate": true,
"targetLanguage": "en"
}
}
Output settings
| Parameter | Type | Default | Description |
|---|---|---|---|
saveExtractedMarkdownToKeyValueStore | boolean | true | Save extracted Markdown |
saveRawHtmlToKeyValueStore | boolean | false | Save raw HTML |
saveExtractedTextToKeyValueStore | boolean | false | Save extracted plain text |
saveExtractedJsonToKeyValueStore | boolean | false | Save extracted JSON with metadata |
saveExtractedXmlToKeyValueStore | boolean | false | Save extracted XML |
saveExtractedXmlTeiToKeyValueStore | boolean | false | Save extracted XML-TEI (scholarly format) |
datasetName | string | Named dataset for results (empty = default run dataset) | |
keyValueStoreName | string | Named key-value store for content files (empty = default) | |
requestQueueName | string | Named request queue for pending URLs (empty = default) |
Browser settings
| Parameter | Type | Default | Description |
|---|---|---|---|
launcher | string | CHROMIUM | Browser engine: CHROMIUM or FIREFOX |
headless | boolean | true | Run browser in headless mode |
waitUntil | string | LOAD | Navigation event: LOAD, NETWORKIDLE, DOMCONTENTLOADED |
pageLoadTimeoutSecs | integer | 60 | Page load timeout in seconds |
ignoreCorsAndCsp | boolean | false | Disable CORS/CSP restrictions |
closeCookieModals | boolean | true | Auto-dismiss cookie consent banners |
maxScrollHeightPixels | integer | 5000 | Max scroll height in pixels (0 = disable) |
userAgent | string | "" | Custom User-Agent string |
ignoreSslErrors | boolean | false | Skip SSL certificate verification |
Proxy
| Parameter | Type | Default | Description |
|---|---|---|---|
proxyConfiguration | object | Apify proxy settings (use the proxy editor in Console) | |
proxyRotation | string | RECOMMENDED | Rotation: RECOMMENDED, PER_REQUEST, UNTIL_FAILURE |
Cookies & headers
| Parameter | Type | Default | Description |
|---|---|---|---|
initialCookies | array | [] | Cookies pre-set on all pages (JSON array of cookie objects) |
customHttpHeaders | object | {} | Custom HTTP headers added to all requests |
Diagnostics
| Parameter | Type | Default | Description |
|---|---|---|---|
debugLog | boolean | false | Include debug messages in log output |
browserLog | boolean | false | Include browser console messages in log |
Output
Results are stored in an Apify dataset. Each item contains page metadata and references to extracted content files in the key-value store:
{
"loadedUrl": "https://example.com/article",
"httpStatus": 200,
"loadedAt": "2026-04-11T12:00:00.000Z",
"metadata": {
"title": "Article Title",
"author": "Author Name",
"publishedAt": "2026-01-15",
"description": "Article description",
"siteName": "Example",
"lang": "en"
},
"extractedMarkdown": {
"key": "example-com-article_markdown",
"url": "https://api.apify.com/v2/key-value-stores/STORE_ID/records/example-com-article_markdown",
"hash": "a1b2c3d4",
"length": 4523
}
}
Extracted content files (Markdown, text, JSON, XML) are stored in the key-value store. Access them via the URL in each dataset item or download them from the Apify Console.
Example
Crawl all blog posts from a site with glob filtering:
{
"startUrls": [{"url": "https://example.com/blog"}],
"globs": ["https://example.com/blog/**"],
"excludes": ["https://example.com/blog/archive/**"],
"maxPagesPerCrawl": 100,
"maxCrawlingDepth": 2,
"trafilaturaConfig": {
"favorPrecision": true,
"deduplicate": true
},
"proxyConfiguration": {
"useApifyProxy": true
}
}
CLI and Docker alternatives
The same extraction engine is available as npm package and Docker for local use. All extraction options work the same across all interfaces.
Updated: April 11, 2026