Trafilatura: web content extraction with Python

The boilerplate problem

Anyone who's tried to scrape article text from a web page knows the pain. You fetch the HTML, and what you get back is a tangled mess of navigation menus, cookie banners, sidebar widgets, footer links, ad containers, and -- somewhere buried in the middle -- the actual content you wanted. The text-to-markup ratio on a typical news article is shockingly low; the article itself might be 2,000 words, but the HTML source runs to 150KB of DOM nodes that have nothing to do with it.

This is the boilerplate problem, and it's been a persistent headache in web scraping, corpus linguistics, and NLP data pipelines for decades. You can't just strip all HTML tags and call it a day -- that gives you a wall of concatenated text where navigation labels mix with article paragraphs and JavaScript snippets.

Trafilatura is a Python library built specifically to solve this. It takes raw HTML as input and returns the main textual content -- the part a human would actually read -- along with metadata like title, author, date, and categories¹. The name itself is Italian; trafilatura means "wire drawing," referring to the process of pulling raw material through a die to produce something refined. Which is a fitting metaphor for what the library does to HTML.

Who built it and why

Trafilatura was created by Adrien Barbaresi, a research scientist at the Berlin-Brandenburg Academy of Sciences (BBAW), where he works on the DWDS and ZDL digital lexicography projects². The library grew out of a practical need: building large text corpora from web sources for linguistic research. Corpus linguists need clean, structured text -- not HTML soup -- and they need it at scale.

Barbaresi published the tool as an open-source project and presented it formally at ACL 2021 (the Annual Meeting of the Association for Computational Linguistics) in a systems demonstration paper¹. Since then it's been adopted well beyond academia. HuggingFace, IBM, Microsoft Research, Stanford, and the Allen Institute all use it in production pipelines³. The library crossed 5,400 stars on GitHub as of early 2025 and version 2.0.0 shipped in December 2024⁴.

Contextractor uses Trafilatura under the hood as its primary extraction engine.

How the extraction works

Trafilatura extraction pipeline showing tree pruning, content scoring, and fallback mechanisms

Trafilatura doesn't use machine learning for its core extraction. It's heuristic-based, which turns out to be an advantage -- a 2023 SIGIR paper by Bevendorff et al. that benchmarked 14 extraction tools found that heuristic extractors "perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad"⁵.

The pipeline works in stages. First, the raw HTML is parsed into an lxml tree structure. Then comes tree pruning: XPath expressions strip out elements that are almost never part of the main content -- <nav>, <footer>, <aside>, known ad container class names, social media widgets, and similar boilerplate patterns³.

After pruning, the remaining nodes go through content scoring. Each element gets evaluated based on text density (how much text relative to markup), link density (navigation-heavy blocks tend to be mostly links), and element type classification. Paragraphs inside an <article> tag score differently than text inside a <div class="sidebar">.

Here's the clever part: if the initial extraction produces output that looks too short or too noisy, Trafilatura doesn't just give up. It falls back to readability-lxml (a Python port of Mozilla's Readability algorithm) for a second attempt. If that still doesn't look right, there's a third fallback to jusText, a boilerplate removal library originally developed at Masaryk University¹. The outputs get compared, and the best result wins.

You can skip the fallbacks entirely with fast=True if speed matters more than coverage -- that roughly doubles throughput.

Output formats

One thing I appreciate about Trafilatura is the range of output formats. Most extraction libraries give you plain text and maybe HTML. Trafilatura supports seven:

Plain text -- stripped of all markup, just paragraphs
Markdown -- preserves headings, lists, emphasis
HTML -- cleaned HTML with structure intact
XML -- custom schema with metadata embedded
XML-TEI -- conformant to the Text Encoding Initiative standard, which matters a lot in digital humanities and corpus linguistics³
JSON -- structured output with metadata fields
CSV -- tabular format for batch processing

The XML-TEI output is what makes Trafilatura particularly popular in academic settings. TEI is the de facto standard for encoding texts in humanities research, and having direct TEI output from a web scraper saves a significant post-processing step. You can even validate the output against the TEI schema by passing tei_validation=True³.

Configuration knobs

The extract() function has a decent set of parameters for controlling what gets included³:

include_comments -- whether to extract user comments (on by default)
include_tables -- extract text from <table> elements (on by default)
include_links -- preserve href targets in the output
include_images -- keep track of image URLs and alt text
include_formatting -- retain structural elements like headings and lists
favor_precision -- focus on the most central content, aggressively filtering noise
favor_recall -- cast a wider net, include more content at the risk of some boilerplate
target_language -- filter results by ISO 639-1 language code
prune_xpath -- custom XPath expressions to remove specific elements

The precision/recall trade-off is genuinely useful. If you're building a training dataset and care more about clean text, favor_precision=True is the right call. If you're archiving content and don't want to miss anything, favor_recall=True makes more sense.

How it stacks up against alternatives

There's no shortage of content extraction libraries. Here's how Trafilatura compares to the main ones, based on the Bevendorff et al. SIGIR 2023 benchmark and Trafilatura's own evaluation dataset⁵⁶:

Tool	Approach	F1 Score	Notes
Trafilatura	Heuristic + fallbacks	0.909	Best mean F1 across datasets
readability-lxml	Heuristic (Mozilla port)	0.801	High median, lower mean
newspaper3k	Heuristic, news-focused	0.713	Geared toward news articles
jusText	Heuristic	0.742	Good at boilerplate removal
goose3	Heuristic	0.793	Article extraction focus
boilerpipe (boilerpy3)	ML (shallow features)	0.777	Java origins, Python wrapper

The F1 scores above are from Trafilatura's 750-document evaluation set⁶. The SIGIR 2023 benchmark, which combined eight different evaluation datasets, found that Trafilatura had the best overall mean performance (0.883 mean F1) while Readability had the highest median (0.970)⁵. No single tool dominated every dataset -- which is honest and expected.

One thing the benchmarks don't capture well: newspaper3k (now newspaper4k) has been essentially unmaintained for stretches at a time, and boilerpipe is a Java library with aging Python bindings. Trafilatura is actively maintained with regular releases, which matters when web pages keep evolving their markup patterns.

What people use it for

The use cases cluster around a few areas:

Corpus building and linguistic research -- this is Trafilatura's origin story. Researchers scraping thousands of web pages to build text corpora for analysis need clean, structured output. The TEI-XML format and metadata extraction make it particularly well-suited here¹.

LLM training data preparation -- with the explosion of large language models, there's massive demand for clean web text. Trafilatura shows up in data pipelines at HuggingFace and similar organizations that need to process web crawl data at scale³.

Data journalism -- journalists investigating trends across hundreds of news sources need the article text without the surrounding chrome. Trafilatura's batch processing capabilities and JSON output fit this workflow well.

Content archiving -- digital preservation projects that want to capture the meaningful content of web pages (rather than full HTML snapshots) can use Trafilatura to distill pages down to their essential text and metadata.

SEO and content analysis -- analyzing competitor content or auditing large sites becomes much simpler when you can extract just the main text from each page programmatically.

Practical considerations

Trafilatura runs on Python 3.8+ and installs with a straightforward pip install trafilatura⁴. It has no browser dependency -- everything runs through lxml and HTTP requests, which keeps it fast. On my benchmarks, extracting content from a pre-downloaded HTML page takes single-digit milliseconds.

The library also includes a command-line interface, which is handy for quick one-off extractions:

trafilatura -u "https://example.com/article"

For batch operations, it supports parallel processing of URL lists and can discover URLs through sitemaps and RSS feeds -- features that most extraction-only libraries lack³.

One caveat: Trafilatura works on static HTML. If a page loads its content through JavaScript (like a React single-page app), you'll need to render it first with something like Playwright or Selenium, then pass the rendered HTML to Trafilatura. The library itself doesn't execute JavaScript.

The license changed from GPLv3+ to Apache 2.0 starting with version 1.8.0, which removed a significant adoption barrier for commercial projects⁴.

Citations

Adrien Barbaresi: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL-IJCNLP 2021: System Demonstrations, pp. 122-131 ↩ ↩² ↩³ ↩⁴
Adrien Barbaresi: Personal website. Retrieved March 4, 2026 ↩
Trafilatura: Documentation. Retrieved March 4, 2026 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Trafilatura: PyPI package page. Retrieved March 4, 2026 ↩ ↩² ↩³
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩² ↩³
Trafilatura: Evaluation and benchmarks. Retrieved March 4, 2026 ↩ ↩²

Updated: March 4, 2026