About Contextractor, the web content extraction tool

What is Contextractor?

Contextractor is an open-source web content extraction tool — a web scraper aimed at getting the readable article out of a page, not the whole DOM. The extraction runs on the Rust port of Trafilatura, and the crawling is handled by Crawlee, a TypeScript crawler that can drive Playwright (adaptive by default, or Firefox/Chromium) or fetch pages over plain HTTP with Cheerio when no browser is needed. The tool strips away navigation, ads, and boilerplate, leaving just the text you need. The website at contextractor.com serves as a playground where you can enter a URL, configure extraction settings, and preview results.

Why the Rust port of Trafilatura?

On the ScrapingHub article set (181 articles), the Rust port scores an F1 of 0.966 (precision 0.942, recall 0.991) — ahead of go-trafilatura at 0.960 and the original Python Trafilatura at 0.9581. That tracks with what you'd expect: Trafilatura's hybrid extraction approach consistently lands at or near the top of independent content-extraction benchmarks, and the Rust port keeps that lead while adding ML page-type classification and confidence scoring on top. The Rust port is exposed to TypeScript through a napi-rs binding, so the Apify Actor and the playground get the same extraction quality with no Python runtime involved.

Features

  • Playground — Enter a URL and preview extraction results at contextractor.com
  • High accuracy extraction — the Rust port of Trafilatura and its hybrid extraction approach for the best balance of precision and recall
  • Multiple output formats — Plain text, Markdown, JSON, cleaned HTML, and the original raw page source
  • Metadata extraction — Automatically extracts title, author, publication date, and more
  • Free and open-source — No registration required, source code on GitHub

Use cases

  • Build LLM training datasets — Extract clean text from web pages for machine learning and AI applications
  • Feed content into RAG pipelines — Get clean, structured content for retrieval-augmented generation
  • Research and academic text extraction — Collect article content for analysis without boilerplate noise
  • Content monitoring — Track and extract content changes from websites

Apify Actor

For automated and large-scale extraction, use the Contextractor Apify Actor. It supports JavaScript rendering with Playwright, link crawling across sites, and configurable extraction modes.

Company behind

Contextractor is operated by Glueo, s.r.o., a Prague-based software development company that runs its own online services like Contextractor as well as providing custom software development.

Citations

  1. rs-trafilatura: README — benchmarks. Retrieved May 31, 2026 ↩

Updated: May 31, 2026