About Contextractor, the web content extraction tool

What is Contextractor?

Contextractor is an open-source web content extraction tool aimed at getting the readable article out of a page, not the whole DOM. The extraction runs on the Rust port of Trafilatura, and the crawling is handled by Crawlee, a TypeScript crawler that can drive Playwright (adaptive by default, or Firefox/Chromium) or fetch pages over plain HTTP with Cheerio when no browser is needed. The tool strips away navigation, ads, and boilerplate, leaving just the text you need. The website at contextractor.com serves as a playground where you can enter a URL, configure extraction settings, and preview results.

Why the Rust port of Trafilatura?

On the Scrapinghub article set (181 articles), the Rust port scores an F1 of 0.966 (precision 0.942, recall 0.991) — ahead of go-trafilatura at 0.960 and the original Python Trafilatura at 0.958¹. Trafilatura's hybrid extraction approach is the engine's foundation, and the Rust port keeps that lead while adding auxiliary ML page-type classification and confidence scoring on top of the heuristic extraction core (the extraction step itself stays ML-free). The Rust port is exposed to TypeScript through a napi-rs binding, so every surface — the npm CLI and library, the Apify Actor, and the playground — gets the same extraction quality with no Python runtime involved.

Token-efficient output for LLMs

Because Contextractor returns just the readable content — navigation, ads, scripts, and boilerplate stripped — its Markdown output typically runs 80–90% fewer tokens than the raw HTML². For an LLM or RAG pipeline that is the practical payoff: the same page costs a fraction of the input tokens to process.

Features

Playground — Enter a URL and preview extraction results at contextractor.com
High accuracy extraction — the Rust port of Trafilatura and its hybrid extraction approach for the best balance of precision and recall
Multiple output formats — Plain text, Markdown, JSON, and HTML; the playground's Original raw HTML checkbox additionally returns the raw page source
Metadata extraction — Automatically extracts title, author, publication date, and more
Free and open-source — No registration required, source code on GitHub

Use cases

Build LLM training datasets — Extract clean text from web pages for machine learning and AI applications
Feed content into RAG pipelines — Get clean, structured content for retrieval-augmented generation
Research and academic text extraction — Collect article content for analysis without boilerplate noise
Content monitoring — Re-extract clean page content on a schedule to feed your own change-detection or diffing pipeline

Apify Actor

For automated and large-scale extraction, use the Contextractor Apify Actor. It supports JavaScript rendering with Playwright, link crawling across sites, and configurable extraction modes.

The company behind Contextractor

Contextractor is operated by Glueo, s.r.o., a Prague-based software company that runs its own online services, such as Contextractor, and provides custom software development.

Citations

rs-trafilatura: README — benchmarks. Retrieved May 31, 2026 ↩
Measured by tokenizing the raw fetched HTML against Contextractor's Markdown output with the o200k_base (GPT-4o) tokenizer across a 28-page corpus (news, documentation, encyclopedia, articles, blog, academic, and legal pages): median 90%, corpus-wide 82% fewer tokens. The exact saving varies with how much boilerplate a page carries. Retrieved June 18, 2026 ↩

Updated: July 5, 2026