Contextractor — content extraction tool

Contextractor playground

Preview extraction results, adjust extraction settings, and generate ready-to-run commands.

URL to extract

Extraction & output

Extraction mode

Content

Output

Page & browser

Browser behavior

Commands or code only
Commands or code only

Crawl scope

Commands or code only
Commands or code only
Commands or code only
Commands or code only
Commands or code only
Commands or code only
Commands or code only
Commands or code only
Commands or code only
Commands or code only
Commands or code only
Commands or code only

Session & proxy

Commands or code only
Commands or code only
Commands or code only

Storage

Save

Commands or code only
Commands or code only

What to do:

What is Contextractor?

Contextractor extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need.

It is built on the Rust port of Trafilatura for extraction, with Crawlee — a TypeScript crawler driving Playwright — handling the crawling. Ideal for building LLM training datasets, RAG pipelines, and research applications.

Run Contextractor at scale on Apify, install the npm CLI or library, add it to Python with the PyPI package, or use the Playground to enter a URL and preview extraction results in your browser. Source code on GitHub.

What is Contextractor?

What is Trafilatura?

Trafilatura is an open-source library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and is consistently one of the top-rated extractors in independent benchmarks. Contextractor runs the Rust port of Trafilatura through a napi-rs binding, which scores the highest F1 (0.966) on the ScrapingHub article set — ahead of the Go port (0.960) and the original Python implementation (0.958) — paired with Crawlee and Playwright for crawling — same heuristics, no Python runtime required.