Trafilatura: High-Accuracy Web Content Extraction
Fetch any web page and look at the HTML source. You'll find the article text buried under layers of navigation menus, cookie banners, ad containers, sidebar widgets, and footer links. A typical news page might have 2,000 words of actual content inside 150KB of markup that has nothing to do with it. Just stripping all HTML tags doesn't work either — you end up with navigation labels mashed into article paragraphs.
Trafilatura is the extraction approach that solves this. Give it raw HTML, get back the main text a human would actually read, plus metadata like title, author, date, and categories1. The name comes from Italian — trafilatura means "wire drawing," the process of pulling raw material through a die to refine it. Fitting.
Contextractor's extraction engine is the Rust port of Trafilatura (rs-trafilatura), exposed to TypeScript through a napi-rs binding. Same heuristics as the original, native speed, and — this is the part that matters in a Node deployment — no Python runtime anywhere in the stack.
Origin
Adrien Barbaresi, a research scientist at the Berlin-Brandenburg Academy of Sciences (BBAW), built the original Trafilatura out of necessity2. His work on the DWDS and ZDL digital lexicography projects required building large text corpora from web sources, and corpus linguists need clean structured text — not HTML soup — at scale.
He presented it formally at ACL 20211, and adoption spread quickly beyond academia. HuggingFace, IBM, Microsoft Research, Stanford, and the Allen Institute all run it in production pipelines3. The library passed 5,400 GitHub stars by early 20254. It has remained one of the top-rated extractors in independent benchmarks for years, which is exactly why a port made sense — the heuristics are proven, the bottleneck was the runtime.
That's where the Rust port comes in. rs-trafilatura reimplements the extraction logic as a native crate (MIT OR Apache-2.0), then ships it to JavaScript callers through napi-rs. Contextractor consumes it as a plain npm dependency — @contextractor/extraction-native for the binding, contextractor for the CLI and library API. There's no interpreter to install, no virtualenv, no pip step.
How extraction works
No machine learning in the core extraction step. That's actually an advantage — a 2023 SIGIR paper benchmarking 14 extraction tools found that heuristic extractors "perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad"5.
The pipeline runs in stages. Raw HTML gets parsed into a DOM tree, then tree pruning strips elements that are almost never content — <nav>, <footer>, <aside>, known ad container classes, social widgets3. After that, remaining nodes go through content scoring: text density (text vs. markup ratio), link density (navigation blocks are mostly links), and element type classification.
The clever part is the fallback chain. If the initial extraction looks too short or noisy, the engine retries with a Readability-style pass (the same logic Mozilla's Readability uses). Still not good enough? It falls back to a jusText-style boilerplate pass, the approach originally developed at Masaryk University1. The candidate outputs get compared, best result wins.
A fast mode skips the fallbacks entirely — roughly doubles throughput when you trust the first pass.
What makes rs-trafilatura distinct
The Rust port isn't a line-for-line translation. It keeps the original heuristic core but layers a few things on top that the Python library doesn't have.
The first is machine-learning page-type classification. Before extraction even starts, an XGBoost model (200 trees, 181 features) sorts each page into one of seven types — article, documentation, service, forum, collection, listing, or product6. A forum thread and a product page need completely different extraction strategies, and guessing the type up front is what lets the engine pick the right one instead of treating everything like a news article.
Off the back of that classification come per-type extraction profiles. The port ships dedicated handling for 12 forum platforms and 4 documentation frameworks, plus a JSON-LD fallback for product pages when the structured data is present6. This is the kind of thing that's tedious to maintain but pays off — a vBulletin forum and a Discourse forum look nothing alike in the DOM, and hard-coded profiles beat generic heuristics on both.
The third addition is confidence scoring. A separate 27-feature XGBoost quality predictor rates each extraction from 0.0 to 1.06. You get a number back alongside the text, so a pipeline can route low-confidence results to review or a second pass instead of silently shipping garbage. If you've ever run an extractor across ten thousand pages and had no idea which results to trust, you'll appreciate this one.
Output formats
Most extraction libraries give you plain text and maybe HTML. Contextractor gives you five practical formats: plain text for raw content, Markdown (preserves headings, lists, and emphasis), cleaned HTML that keeps document structure without the cruft, JSON with the metadata fields attached, and the original raw HTML of the page when you need the unprocessed source3.
For most downstream work that's the full set you actually want. Markdown feeds straight into LLM pipelines and static-site generators; JSON carries the title/author/date metadata for indexing; plain text is the lowest common denominator; cleaned HTML keeps structure when you need to re-render; and the original raw HTML preserves the untouched page source for archival or re-extraction.
Configuration
The extraction API exposes a practical set of toggles3:
- include comments — pull in user comments (on by default)
- include tables — text from
<table>elements (on by default) - include links — preserve
hreftargets - include images — keep image URLs and alt text
- favor precision — aggressive filtering, less noise
- favor recall — wider net, more content captured
- target language — filter by ISO 639-1 code
- custom prune rules — XPath to remove specific elements
The precision/recall toggle is genuinely useful. Building a training dataset and want only the clean core? Favor precision. Archiving web content where missing a paragraph is worse than catching a stray sidebar line? Favor recall.
Benchmarks
On the ScrapingHub article extraction set (181 articles), the Rust port edges out both the Go and Python implementations6:
| Implementation | Approach | F1 | Precision | Recall |
|---|---|---|---|---|
| rs-trafilatura | Heuristic + ML routing | 0.966 | 0.942 | 0.991 |
| go-trafilatura | Heuristic port (Go) | 0.960 | — | — |
| Python trafilatura | Heuristic + fallbacks | 0.958 | — | — |
The headline number there is recall: 0.991 means it's missing almost nothing on clean article pages. Precision sits at 0.942, which is the expected trade — catch everything, occasionally grab a line you didn't want.
Articles are the easy case, though. The more honest test is the WCXB multi-type benchmark, which spans 2,008 annotated pages across all seven page types. The port scores F1 0.859 on the 1,497-page development set and 0.893 on the 511-page held-out test set6. The per-type spread is wide and tells the real story:
| Page type | F1 |
|---|---|
| Article | 0.932 |
| Documentation | 0.931 |
| Service | 0.843 |
| Forum | 0.792 |
| Collection | 0.713 |
| Listing | 0.704 |
| Product | 0.670 |
Articles and documentation are nearly solved. Products sit at 0.670 — which makes sense, since a product page is mostly structured fields, reviews, and recommendation rails rather than a single body of prose. No extractor magically reads a product page like an article, and the port doesn't pretend otherwise.
Speed holds up too: roughly 71 article files per second, about 46 per second averaged across all page types6. Running natively instead of through a Python interpreter is a big part of why.
Practical bits
Contextractor pulls the port in as contextractor from npm. No interpreter, no virtualenv, no system packages — npm install and the native binary comes down with it. Extracting content from pre-fetched HTML takes single-digit milliseconds, since the heavy lifting happens in compiled Rust rather than an interpreted layer.
One thing to keep in mind: the extraction step works on static HTML. JavaScript-rendered content (React SPAs and the like) has to be rendered first. Contextractor handles that with Crawlee driving Playwright — the browser renders the page, the rendered HTML goes to the extractor. The extraction engine itself doesn't execute JS, and it doesn't need to; that's a separate concern with its own tool.
The port is dual-licensed MIT OR Apache-2.0, so there's no copyleft friction for commercial use — pick whichever license fits your project.
Citations
-
Adrien Barbaresi: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL-IJCNLP 2021: System Demonstrations, pp. 122-131 ↩ ↩2 ↩3
-
Adrien Barbaresi: Personal website. Retrieved May 31, 2026 ↩
-
Trafilatura: Documentation. Retrieved May 31, 2026 ↩ ↩2 ↩3 ↩4
-
Trafilatura: PyPI package page. Retrieved May 31, 2026 ↩
-
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩
-
Murrough Foley: rs-trafilatura — Rust port of Trafilatura. Retrieved May 31, 2026 ↩ ↩2 ↩3 ↩4 ↩5 ↩6
Updated: June 12, 2026