About
What is Contextractor?
Contextractor is a web content extraction tool available as a CLI, Docker container, and Apify Actor. It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958), to strip away navigation, ads, and boilerplateβleaving just the text you need. The website at contextractor.com serves as a playground where you can configure extraction settings, preview results, and generate ready-to-use commands.
Why Trafilatura?
Trafilatura achieves the highest F1 score (0.958) among open-source content extraction tools in independent benchmarks, outperforming newspaper4k (0.949), Mozilla Readability (0.947), and goose3 (0.896). With over 4,900 GitHub stars and production deployments at HuggingFace, IBM, and Microsoft Research, Trafilatura has become the de facto standard for text extraction.
Features
- CLI β Install via pip and extract content from the command line
- Docker β Run as a container for consistent environments and easy deployment
- Apify Actor β Automated and large-scale extraction with JavaScript rendering and link crawling
- Playground β Configure settings, preview extraction, and generate CLI/Docker/Apify commands at contextractor.com
- High accuracy extraction β Uses Trafilatura's hybrid extraction approach for the best balance of precision and recall
- Multiple output formats β Markdown, plain text, JSON, JSONL, XML, and XML-TEI
- Metadata extraction β Automatically extracts title, author, publication date, and more
- Free β No registration required
Use cases
- Build LLM training datasets β Extract clean text from web pages for machine learning and AI applications
- Feed content into RAG pipelines β Get clean, structured content for retrieval-augmented generation
- Research and academic text extraction β Collect article content for analysis without boilerplate noise
- Content monitoring β Track and extract content changes from websites
Apify Actor
For automated and large-scale extraction, use the Contextractor Apify actor. It supports JavaScript rendering with Playwright, link crawling across sites, and configurable extraction modes.
Company behind
Contextractor is operated from Prague by Glueo, s.r.o., a Prague based software development company, operating its own online services like Contextractor as well as providing custom software development.
Updated: April 9, 2026