Trafilatura: web content extraction with Python
The boilerplate problem
Anyone who's tried to scrape article text from a web page knows the pain. You fetch the HTML, and what you get back is a tangled mess of navigation menus, cookie banners, sidebar widgets, footer links, ad containers, and -- somewhere buried in the middle -- the actual content you wanted. The text-to-markup ratio on a typical news article is shockingly low; the article itself might be 2,000 words, but the HTML source runs to 150KB of DOM nodes that have nothing to do with it.
This is the boilerplate problem, and it's been a persistent headache in web scraping, corpus linguistics, and NLP data pipelines for decades. You can't just strip all HTML tags and call it a day -- that gives you a wall of concatenated text where navigation labels mix with article paragraphs and JavaScript snippets.
Trafilatura is a Python library built specifically to solve this. It takes raw HTML as input and returns the main textual content -- the part a human would actually read -- along with metadata like title, author, date, and categories1. The name itself is Italian; trafilatura means "wire drawing," referring to the process of pulling raw material through a die to produce something refined. Which is a fitting metaphor for what the library does to HTML.
Who built it and why
Trafilatura was created by Adrien Barbaresi, a research scientist at the Berlin-Brandenburg Academy of Sciences (BBAW), where he works on the DWDS and ZDL digital lexicography projects2. The library grew out of a practical need: building large text corpora from web sources for linguistic research. Corpus linguists need clean, structured text -- not HTML soup -- and they need it at scale.
Barbaresi published the tool as an open-source project and presented it formally at ACL 2021 (the Annual Meeting of the Association for Computational Linguistics) in a systems demonstration paper1. Since then it's been adopted well beyond academia. HuggingFace, IBM, Microsoft Research, Stanford, and the Allen Institute all use it in production pipelines3. The library crossed 5,400 stars on GitHub as of early 2025 and version 2.0.0 shipped in December 20244.
Contextractor uses Trafilatura under the hood as its primary extraction engine.
How the extraction works
Trafilatura doesn't use machine learning for its core extraction. It's heuristic-based, which turns out to be an advantage -- a 2023 SIGIR paper by Bevendorff et al. that benchmarked 14 extraction tools found that heuristic extractors "perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad"5.
The pipeline works in stages. First, the raw HTML is parsed into an lxml tree structure. Then comes tree pruning: XPath expressions strip out elements that are almost never part of the main content -- <nav>, <footer>, <aside>, known ad container class names, social media widgets, and similar boilerplate patterns3.
After pruning, the remaining nodes go through content scoring. Each element gets evaluated based on text density (how much text relative to markup), link density (navigation-heavy blocks tend to be mostly links), and element type classification. Paragraphs inside an <article> tag score differently than text inside a <div class="sidebar">.
Here's the clever part: if the initial extraction produces output that looks too short or too noisy, Trafilatura doesn't just give up. It falls back to readability-lxml (a Python port of Mozilla's Readability algorithm) for a second attempt. If that still doesn't look right, there's a third fallback to jusText, a boilerplate removal library originally developed at Masaryk University1. The outputs get compared, and the best result wins.
You can skip the fallbacks entirely with fast=True if speed matters more than coverage -- that roughly doubles throughput.
Output formats
One thing I appreciate about Trafilatura is the range of output formats. Most extraction libraries give you plain text and maybe HTML. Trafilatura supports seven:
- Plain text -- stripped of all markup, just paragraphs
- Markdown -- preserves headings, lists, emphasis
- HTML -- cleaned HTML with structure intact
- XML -- custom schema with metadata embedded
- XML-TEI -- conformant to the Text Encoding Initiative standard, which matters a lot in digital humanities and corpus linguistics3
- JSON -- structured output with metadata fields
- CSV -- tabular format for batch processing
The XML-TEI output is what makes Trafilatura particularly popular in academic settings. TEI is the de facto standard for encoding texts in humanities research, and having direct TEI output from a web scraper saves a significant post-processing step. You can even validate the output against the TEI schema by passing tei_validation=True3.
Configuration knobs
The extract() function has a decent set of parameters for controlling what gets included3:
include_comments-- whether to extract user comments (on by default)include_tables-- extract text from<table>elements (on by default)include_links-- preservehreftargets in the outputinclude_images-- keep track of image URLs and alt textinclude_formatting-- retain structural elements like headings and listsfavor_precision-- focus on the most central content, aggressively filtering noisefavor_recall-- cast a wider net, include more content at the risk of some boilerplatetarget_language-- filter results by ISO 639-1 language codeprune_xpath-- custom XPath expressions to remove specific elements
The precision/recall trade-off is genuinely useful. If you're building a training dataset and care more about clean text, favor_precision=True is the right call. If you're archiving content and don't want to miss anything, favor_recall=True makes more sense.
How it stacks up against alternatives
There's no shortage of content extraction libraries. Here's how Trafilatura compares to the main ones, based on the Bevendorff et al. SIGIR 2023 benchmark and Trafilatura's own evaluation dataset56:
| Tool | Approach | F1 Score | Notes |
|---|---|---|---|
| Trafilatura | Heuristic + fallbacks | 0.909 | Best mean F1 across datasets |
| readability-lxml | Heuristic (Mozilla port) | 0.801 | High median, lower mean |
| newspaper3k | Heuristic, news-focused | 0.713 | Geared toward news articles |
| jusText | Heuristic | 0.742 | Good at boilerplate removal |
| goose3 | Heuristic | 0.793 | Article extraction focus |
| boilerpipe (boilerpy3) | ML (shallow features) | 0.777 | Java origins, Python wrapper |
The F1 scores above are from Trafilatura's 750-document evaluation set6. The SIGIR 2023 benchmark, which combined eight different evaluation datasets, found that Trafilatura had the best overall mean performance (0.883 mean F1) while Readability had the highest median (0.970)5. No single tool dominated every dataset -- which is honest and expected.
One thing the benchmarks don't capture well: newspaper3k (now newspaper4k) has been essentially unmaintained for stretches at a time, and boilerpipe is a Java library with aging Python bindings. Trafilatura is actively maintained with regular releases, which matters when web pages keep evolving their markup patterns.
What people use it for
The use cases cluster around a few areas:
Corpus building and linguistic research -- this is Trafilatura's origin story. Researchers scraping thousands of web pages to build text corpora for analysis need clean, structured output. The TEI-XML format and metadata extraction make it particularly well-suited here1.
LLM training data preparation -- with the explosion of large language models, there's massive demand for clean web text. Trafilatura shows up in data pipelines at HuggingFace and similar organizations that need to process web crawl data at scale3.
Data journalism -- journalists investigating trends across hundreds of news sources need the article text without the surrounding chrome. Trafilatura's batch processing capabilities and JSON output fit this workflow well.
Content archiving -- digital preservation projects that want to capture the meaningful content of web pages (rather than full HTML snapshots) can use Trafilatura to distill pages down to their essential text and metadata.
SEO and content analysis -- analyzing competitor content or auditing large sites becomes much simpler when you can extract just the main text from each page programmatically.
Practical considerations
Trafilatura runs on Python 3.8+ and installs with a straightforward pip install trafilatura4. It has no browser dependency -- everything runs through lxml and HTTP requests, which keeps it fast. On my benchmarks, extracting content from a pre-downloaded HTML page takes single-digit milliseconds.
The library also includes a command-line interface, which is handy for quick one-off extractions:
trafilatura -u "https://example.com/article"
For batch operations, it supports parallel processing of URL lists and can discover URLs through sitemaps and RSS feeds -- features that most extraction-only libraries lack3.
One caveat: Trafilatura works on static HTML. If a page loads its content through JavaScript (like a React single-page app), you'll need to render it first with something like Playwright or Selenium, then pass the rendered HTML to Trafilatura. The library itself doesn't execute JavaScript.
The license changed from GPLv3+ to Apache 2.0 starting with version 1.8.0, which removed a significant adoption barrier for commercial projects4.
Citations
-
Adrien Barbaresi: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL-IJCNLP 2021: System Demonstrations, pp. 122-131 ↩ ↩2 ↩3 ↩4
-
Adrien Barbaresi: Personal website. Retrieved March 4, 2026 ↩
-
Trafilatura: Documentation. Retrieved March 4, 2026 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
Trafilatura: PyPI package page. Retrieved March 4, 2026 ↩ ↩2 ↩3
-
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩2 ↩3
-
Trafilatura: Evaluation and benchmarks. Retrieved March 4, 2026 ↩ ↩2
Updated: March 4, 2026