Trafilatura: web content extraction with Python
Fetch any web page and look at the HTML source. You'll find the article text buried under layers of navigation menus, cookie banners, ad containers, sidebar widgets, and footer links. A typical news page might have 2,000 words of actual content inside 150KB of markup that has nothing to do with it. Just stripping all HTML tags doesn't work either — you end up with navigation labels mashed into article paragraphs.
Trafilatura is a Python library that solves this. Give it raw HTML, get back the main text a human would actually read, plus metadata like title, author, date, and categories1. The name comes from Italian — trafilatura means "wire drawing," the process of pulling raw material through a die to refine it. Fitting.
Contextractor uses Trafilatura as its extraction engine under the hood.
Origin
Adrien Barbaresi, a research scientist at the Berlin-Brandenburg Academy of Sciences (BBAW), built Trafilatura out of necessity2. His work on the DWDS and ZDL digital lexicography projects required building large text corpora from web sources, and corpus linguists need clean structured text — not HTML soup — at scale.
He presented it formally at ACL 20211, and adoption spread quickly beyond academia. HuggingFace, IBM, Microsoft Research, Stanford, and the Allen Institute all run it in production pipelines3. The library passed 5,400 GitHub stars by early 2025, and version 2.0.0 shipped in December 20244.
How extraction works
No machine learning involved in the core extraction. That's actually an advantage — a 2023 SIGIR paper benchmarking 14 extraction tools found that heuristic extractors "perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad"5.
The pipeline runs in stages. Raw HTML gets parsed into an lxml tree, then tree pruning strips elements that are almost never content — <nav>, <footer>, <aside>, known ad container classes, social widgets3. After that, remaining nodes go through content scoring: text density (text vs. markup ratio), link density (navigation blocks are mostly links), and element type classification.
The clever part is the fallback chain. If the initial extraction looks too short or noisy, Trafilatura tries readability-lxml (a Python port of Mozilla's Readability). Still not good enough? It falls back to jusText, a boilerplate removal library from Masaryk University1. The outputs get compared, best result wins.
Set fast=True to skip the fallbacks — roughly doubles throughput.
Output formats
Most extraction libraries give you plain text and maybe HTML. Trafilatura supports seven formats: plain text, Markdown (preserves headings and lists), cleaned HTML, XML with a custom schema, XML-TEI conforming to the Text Encoding Initiative standard, JSON with metadata fields, and CSV for batch work3.
The TEI output is what makes it popular in academic circles. TEI is the de facto standard for encoding texts in humanities research, and getting TEI-XML directly from a web scraper skips an entire post-processing step. You can validate the output against the TEI schema with tei_validation=True6.
Configuration
The extract() function exposes a practical set of parameters3:
include_comments— extract user comments (on by default)include_tables— text from<table>elements (on by default)include_links— preservehreftargetsinclude_images— keep image URLs and alt textfavor_precision— aggressive filtering, less noisefavor_recall— wider net, more content capturedtarget_language— filter by ISO 639-1 codeprune_xpath— custom XPath to remove specific elements
The precision/recall toggle is genuinely useful. Building a training dataset? favor_precision=True. Archiving web content? favor_recall=True.
Benchmarks
Here's how Trafilatura compares in the ScrapingHub article extraction benchmark and the Bevendorff et al. SIGIR 2023 study57:
| Tool | Approach | F1 (ScrapingHub) | Notes |
|---|---|---|---|
| Trafilatura 2.0 | Heuristic + fallbacks | 0.958 | Best mean across datasets |
| newspaper4k | Heuristic, news-focused | 0.949 | Successor to newspaper3k |
| readability (JS) | Heuristic (Mozilla) | 0.947 | Highest median in SIGIR study |
| readability-lxml | Heuristic (Python port) | 0.922 | Solid but lower mean |
| goose3 | Heuristic | 0.896 | Article extraction focus |
| jusText | Heuristic | 0.804 | Good at boilerplate removal |
The SIGIR benchmark, which combined eight evaluation datasets, found Trafilatura had the best overall mean (0.883 F1) while Readability had the highest median (0.970)5. No single tool dominated every dataset — which is honest and expected.
Worth noting: newspaper3k went through long stretches of being unmaintained (newspaper4k picked up the torch), and boilerpipe is a Java library with aging Python bindings. Trafilatura ships regular releases, which matters when web markup patterns keep evolving.
Practical bits
Runs on Python 3.8+. pip install trafilatura and you're done — no browser dependency, everything runs through lxml and HTTP requests4. Extracting content from pre-downloaded HTML takes single-digit milliseconds.
There's a CLI for quick jobs:
trafilatura -u "https://example.com/article"
For batch work, it handles parallel URL processing and can discover URLs through sitemaps and RSS feeds — features most extraction-only libraries don't bother with3.
One thing to keep in mind: Trafilatura works on static HTML. JavaScript-rendered content (React SPAs, for instance) needs a headless browser like Playwright first — then pass the rendered HTML to Trafilatura. The library doesn't execute JS.
The license switched from GPLv3+ to Apache 2.0 at version 1.8.0, which removed a big adoption barrier for commercial use4.
Citations
-
Adrien Barbaresi: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL-IJCNLP 2021: System Demonstrations, pp. 122-131 ↩ ↩2 ↩3
-
Adrien Barbaresi: Personal website. Retrieved March 16, 2026 ↩
-
Trafilatura: Documentation. Retrieved March 16, 2026 ↩ ↩2 ↩3 ↩4 ↩5
-
Trafilatura: PyPI package page. Retrieved March 16, 2026 ↩ ↩2 ↩3
-
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩2 ↩3
-
TEI Consortium: TEI: Text Encoding Initiative. Retrieved March 16, 2026 ↩
-
Trafilatura: Evaluation and benchmarks. Retrieved March 16, 2026 ↩
Updated: March 16, 2026