About

What is Contextractor?

It is a web content extraction tool that extracts clean, readable content from any webpage. It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958), to strip away navigation, ads, and boilerplate—leaving just the text you need.

Why Trafilatura?

Trafilatura achieves the highest F1 score (0.958) among open-source content extraction tools in independent benchmarks, outperforming newspaper4k (0.949), Mozilla Readability (0.947), and goose3 (0.896). With over 4,900 GitHub stars and production deployments at HuggingFace, IBM, and Microsoft Research, Trafilatura has become the de facto standard for text extraction.

Features

High accuracy extraction — Uses Trafilatura's hybrid extraction approach for the best balance of precision and recall
Multiple output formats — Markdown, plain text, or structured data
Metadata extraction — Automatically extracts title, author, publication date, and more
Free — No registration required
Web app — No download needed

Use cases

Build LLM training datasets — Extract clean text from web pages for machine learning and AI applications
Feed content into RAG pipelines — Get clean, structured content for retrieval-augmented generation
Research and academic text extraction — Collect article content for analysis without boilerplate noise
Content monitoring — Track and extract content changes from websites

Apify Actor

For automated and large-scale extraction, use the Contextractor Apify actor. It supports JavaScript rendering with Playwright, link crawling across sites, and configurable extraction modes.

Company behind

Contextractor is operated from Prague by Glueo, s.r.o., a Prague based software development company, operating its own online services like Contextractor as well as providing custom software development.

Updated: January 1, 2025