Contextractor playground

Generate ready-to-run commands or code to export saved crawl results to files.

Export

Commands or code only

Output directory

Storage directory

What to do:

npm CLIGenerate commands

npm libTypeScript code

Apify CLIDownload from Apify

Apify APIDownload via API

NPM package CLI

Export stored extraction results to files using the contextractor CLI. It reads results a prior `extract` crawl saved to storage, so run an extract first (or point --storage at an existing storage directory) — otherwise there is nothing to export. Requires Node 22+.

#!/usr/bin/env bash
set -euo pipefail

mkdir -p contextractor-run
cd contextractor-run

npm init -y
npm install contextractor

npx contextractor export

Available as:

Apify Actoractor, helpApify gives you $5 of free usage monthly
NPM package CLI & libpackage, CLI help, lib help
Source code on GitHub

X (Twitter)

What is Contextractor?

Contextractor extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need. Its Markdown output typically runs 80–90% fewer tokens than the raw HTML, so it is cheap to feed to an LLM. It is free and open source so you just download it an run it by yourself. Use it to build LLM training datasets, RAG pipelines, or research corpora.

It uses a Rust port of Trafilatura, which scores the highest F1 (0.966) on the ScrapingHub benchmark, with Crawlee handling the crawling. The Rust port keeps Trafilatura's heuristic extraction core, adds ML page-type routing so each page gets a matching extraction strategy. Compiled Rust speed, no Python, no GPU.

Run it as a hosted Apify actor, or self-host the npm CLI or npm library — the same engine and options in every channel.

Getting started

The easiest way to try Contextractor — one npx command extracts a page straight to your terminal, with no browser install (you can install it later) and no API key — you host it yourself:

npx contextractor extract-one https://example.com/ --crawler-type cheerio

--crawler-type cheerio fetches over plain HTTP, so no headless browser is downloaded. Need a whole-site crawl or specific formats? Use the playground to build a more advanced command visually, then copy it.