Web content extraction tool
Paste HTML content to extract
What is Contextractor?
It is an online tool where you can extract content from one page, or use it as an Apify actor.
It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958), to strip away navigation, ads, and boilerplate — leaving just the text you need. Ideal for building LLM training datasets, RAG pipelines, and research applications.
Why content extraction?
You fetch an HTML page and get 150KB of navigation, ads, cookie banners, and — somewhere in the middle — the actual article. Content extraction is the process of pulling out just the readable text and discarding the rest.
It's been a research problem since 2003, and heuristic approaches still beat neural models on heterogeneous pages. Browser reading modes, RAG pipelines, and LLM training data all depend on it.
What is Trafilatura?
Trafilatura is a Python library that extracts the main content from web pages — the article text, headings, and metadata — while stripping away navigation, ads, sidebars, and footers.
Originally developed for academic corpus building, it has since become one of the most widely used content extraction tools in the Python ecosystem.
Contextractor is powered by Trafilatura under the hood, giving you a web interface and API on top of its extraction engine. Read the full article on Trafilatura for details on how its heuristic pipeline works, how it compares to alternatives, and what output formats it supports.
Did you know? Apify offers a free tier — you get $5 to use monthly.
Apify also has a super generous Creator plan (though you can run only your own actors) that costs just $1/month (billed $6 semi-annually) and includes a one-time $500 platform credit for your first 6 months — with up to 32 GB RAM and 32 concurrent actor runs.