Contextractor npm library

The contextractor package is both a command-line tool and a TypeScript library you can import into your own Node.js application.

Install

npm install contextractor
npx playwright install chromium

Requires Node 22+. Playwright Chromium is needed for browser-based crawling; for the playwright-firefox crawler type, run npx playwright install firefox instead (a separate browser binary).

Exports

Export	Description
`extractOne(url, options?)`	Extract a single URL (no link-following, nothing persisted) and get the content back directly
`createExtractor(options?)`	Run a crawl programmatically and get results back in memory — a thin, Crawlee-shaped facade
`runExportAction(opts)`	Export stored content to an output directory; returns an `ExportResult`
`runPurgeAction(opts?)`	Delete the storage buckets under a storage directory; returns a `PurgeResult`
`buildProgram()`	Build the Commander program the CLI uses; run any subcommand with `parseAsync`
`configureStorage(dir)`	Point Crawlee storage at a directory before running
`resolveStorageDir()`	Resolve the storage directory using the same order as the CLI
`Dataset`, `KeyValueStore`, `Configuration`	Crawlee storage classes, re-exported for reading results

Single page: extractOne

For one page, skip the crawl machinery — extractOne fetches exactly one URL and resolves to a format → string map keyed by the requested formats (default ['markdown']); it throws when the request fails:

import { extractOne } from "contextractor";

const { markdown } = await extractOne("https://example.com");

const contents = await extractOne("https://example.com", {
  formats: ["markdown", "json", "original"], // 'original' = raw page HTML
  crawlerType: "cheerio",
});

Options are the single-page subset of the createExtractor options — same camelCase names, and proxyConfiguration still works:

Valid formats — the SaveFormat union ('txt' | 'markdown' | 'json' | 'html' | 'original'); the SAVE_FORMATS array is exported too.
Excluded options — save, storageDir, includeHtml (use formats: ['original'] instead), and the crawl/concurrency knobs — maxCrawlDepth, maxRequestsPerCrawl, maxResultsPerCrawl, globs, exclude, selector, useSitemaps, keepUrlFragment, initialConcurrency, maxConcurrency, deduplication, storeSkippedUrls, and sessionPoolName — since the run is pinned to a single URL.
crawlerType — valid values are playwright-adaptive (default), playwright-firefox, playwright-chromium, and cheerio; the CLI's short aliases (adaptive/firefox/chromium) are CLI-only and are rejected by the library, so use the playwright-* names in extractOne and createExtractor.

Run a crawl in memory: createExtractor

Run a multi-page crawl programmatically and get results back without touching disk:

import { createExtractor } from "contextractor";

const extractor = createExtractor({
  save: ["txt-kvs"],
  deduplication: "minimal",
  maxResultsPerCrawl: 10, // bounds the in-memory result set (0 = unlimited)
});

const { dataset, statistics } = await extractor.run(["https://example.com"]);

await dataset.forEach((record, i) => {
  console.log(i, record.url, "depth:", record.crawlDepth);
});
const all = dataset.export(); // LibraryRecord[]

The returned dataset is a ResultDataset (the run-result type — dataset.export() yields LibraryRecord[]), not the re-exported Crawlee Dataset storage class used below for Dataset.open(...). It holds successful extractions only; failed and skipped requests are reflected in statistics (a subset of Crawlee's FinalStatistics), and run() never throws on partial failure.

Option names — the same camelCase names as the JSON config (e.g. crawlerType, maxResultsPerCrawl, save).
Library-only knobs — includeHtml (default false), storageDir (when set, also writes full records to disk), and logLevel (default warning).
Proxy — pass proxyConfiguration: { proxyUrls: [...] } (http/https/socks4/socks5 only).

Drive the CLI from code: buildProgram

Drive the CLI program from code, then read the results back from the dataset:

import {
  buildProgram,
  configureStorage,
  Dataset,
  resolveStorageDir,
} from "contextractor";

const storageDir = resolveStorageDir();
configureStorage(storageDir);

const program = buildProgram();
await program.parseAsync([
  "node",
  "contextractor",
  "extract",
  "https://example.com/",
  "--save",
  "markdown-dataset",
]);

const ds = await Dataset.open("default");
const page = await ds.getData({ limit: 100 });
console.log(`Extracted ${page.count} item(s)`);

A *-dataset save token inlines the extracted content in each record, so Dataset.open(...) can read it back directly. A *-kvs token instead stores content in the key-value store, where each record references its content by key.

Export stored content

runExportAction reads the dataset record index and, for every success record, writes one file per saved format to the output directory — using the inline content or fetching the key-value-store blob by key. File names are derived from the record title (then its URL, then page), and a manifest.json listing every record is written alongside the files.

import { runExportAction } from "contextractor";

const result = await runExportAction({
  outputDir: "./contextractor-output",
  storageDir: "./storage",
});

console.log(`Wrote ${result.filesWritten} file(s) to ${result.outputDir}`);

ExportOpts accepts outputDir and storageDir. ExportResult reports outputDir, filesWritten (a count), recordsTotal, and manifestPath.

Purge storage

runPurgeAction deletes the datasets/, key_value_stores/, and request_queues/ directories under the resolved storage directory — the programmatic counterpart of the CLI's purge subcommand. It never calls process.exit; it returns the resolved path so you can report it:

import { runPurgeAction } from "contextractor";

const { storageDir } = await runPurgeAction({ storageDir: "./storage" });
console.log(`Purged ${storageDir}`);

PurgeOpts accepts an optional storageDir (resolved with the same precedence as the CLI when omitted). The purge is irreversible — the buckets are deleted with no confirmation.

Read Crawlee storage directly

To read back results a previous crawl wrote to disk, use the re-exported Crawlee storage classes. Point Crawlee's storage at the directory with configureStorage, then open the dataset or key-value store and iterate its records — no extraction is run here:

import {
  configureStorage,
  Dataset,
  KeyValueStore,
} from "contextractor";

// Point Crawlee's storage at a directory before opening any store.
configureStorage("./storage");

const ds = await Dataset.open("my-dataset");
await ds.forEach((item) => console.log(item));

const kvs = await KeyValueStore.open("default");
const value = await kvs.getValue("my-key");

Where to go next

Help hub — the index of every Contextractor help page.
Playground — enter a URL and preview extraction results in your browser.
Apify Actor — run extraction at scale on the Apify platform.
npm CLI — the command-line tool, flag reference, and JSON config.
PyPI package — extract from Python with the contextractor PyPI wrapper.

Updated: July 5, 2026