Web scraper - content extraction tool

Contextractor playground

Preview extraction results, adjust extraction settings, and generate ready-to-run commands. Built on the Rust port of Trafilatura, Crawlee, and Playwright. Run at scale on Apify.

HTML to extract

Extraction

Extraction mode

Content

Generate Commands

What is Contextractor?

Contextractor extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need.

It is built on the Rust port of Trafilatura for extraction, with Crawlee — a TypeScript crawler driving Playwright — handling the crawling. Ideal for building LLM training datasets, RAG pipelines, and research applications.

Run Contextractor at scale on Apify, or use the Playground to paste HTML and preview extraction results in your browser. Source code on GitHub.

What is Contextractor?

What is Trafilatura?

Trafilatura is an open-source library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor runs the Rust port of Trafilatura through a napi-rs binding, paired with Crawlee and Playwright for crawling — same heuristics, no Python runtime required.