Trafilatura: web content extraction with Python

Fetch any web page and look at the HTML source. You'll find the article text buried under layers of navigation menus, cookie banners, ad containers, sidebar widgets, and footer links. A typical news page might have 2,000 words of actual content inside 150KB of markup that has nothing to do with it. Just stripping all HTML tags doesn't work either — you end up with navigation labels mashed into article paragraphs.

Trafilatura is a Python library that solves this. Give it raw HTML, get back the main text a human would actually read, plus metadata like title, author, date, and categories¹. The name comes from Italian — trafilatura means "wire drawing," the process of pulling raw material through a die to refine it. Fitting.

Contextractor uses Trafilatura as its extraction engine under the hood.

Origin

Adrien Barbaresi, a research scientist at the Berlin-Brandenburg Academy of Sciences (BBAW), built Trafilatura out of necessity². His work on the DWDS and ZDL digital lexicography projects required building large text corpora from web sources, and corpus linguists need clean structured text — not HTML soup — at scale.

He presented it formally at ACL 2021¹, and adoption spread quickly beyond academia. HuggingFace, IBM, Microsoft Research, Stanford, and the Allen Institute all run it in production pipelines³. The library passed 5,400 GitHub stars by early 2025, and version 2.0.0 shipped in December 2024⁴.

How extraction works

Trafilatura extraction pipeline showing tree pruning, content scoring, and fallback mechanisms

No machine learning involved in the core extraction. That's actually an advantage — a 2023 SIGIR paper benchmarking 14 extraction tools found that heuristic extractors "perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad"⁵.

The pipeline runs in stages. Raw HTML gets parsed into an lxml tree, then tree pruning strips elements that are almost never content — <nav>, <footer>, <aside>, known ad container classes, social widgets³. After that, remaining nodes go through content scoring: text density (text vs. markup ratio), link density (navigation blocks are mostly links), and element type classification.

The clever part is the fallback chain. If the initial extraction looks too short or noisy, Trafilatura tries readability-lxml (a Python port of Mozilla's Readability). Still not good enough? It falls back to jusText, a boilerplate removal library from Masaryk University¹. The outputs get compared, best result wins.

Set fast=True to skip the fallbacks — roughly doubles throughput.

Output formats

Seven output formats from Trafilatura's extract function

Most extraction libraries give you plain text and maybe HTML. Trafilatura supports seven formats: plain text, Markdown (preserves headings and lists), cleaned HTML, XML with a custom schema, XML-TEI conforming to the Text Encoding Initiative standard, JSON with metadata fields, and CSV for batch work³.

The TEI output is what makes it popular in academic circles. TEI is the de facto standard for encoding texts in humanities research, and getting TEI-XML directly from a web scraper skips an entire post-processing step. You can validate the output against the TEI schema with tei_validation=True⁶.

Configuration

The extract() function exposes a practical set of parameters³:

include_comments — extract user comments (on by default)
include_tables — text from <table> elements (on by default)
include_links — preserve href targets
include_images — keep image URLs and alt text
favor_precision — aggressive filtering, less noise
favor_recall — wider net, more content captured
target_language — filter by ISO 639-1 code
prune_xpath — custom XPath to remove specific elements

How favor_precision and favor_recall affect extraction behavior

The precision/recall toggle is genuinely useful. Building a training dataset? favor_precision=True. Archiving web content? favor_recall=True.

Benchmarks

Bar chart comparing F1 scores of content extraction tools

Here's how Trafilatura compares in the ScrapingHub article extraction benchmark and the Bevendorff et al. SIGIR 2023 study⁵⁷:

Tool	Approach	F1 (ScrapingHub)	Notes
Trafilatura 2.0	Heuristic + fallbacks	0.958	Best mean across datasets
newspaper4k	Heuristic, news-focused	0.949	Successor to newspaper3k
readability (JS)	Heuristic (Mozilla)	0.947	Highest median in SIGIR study
readability-lxml	Heuristic (Python port)	0.922	Solid but lower mean
goose3	Heuristic	0.896	Article extraction focus
jusText	Heuristic	0.804	Good at boilerplate removal

The SIGIR benchmark, which combined eight evaluation datasets, found Trafilatura had the best overall mean (0.883 F1) while Readability had the highest median (0.970)⁵. No single tool dominated every dataset — which is honest and expected.

Worth noting: newspaper3k went through long stretches of being unmaintained (newspaper4k picked up the torch), and boilerpipe is a Java library with aging Python bindings. Trafilatura ships regular releases, which matters when web markup patterns keep evolving.

Practical bits

Runs on Python 3.8+. pip install trafilatura and you're done — no browser dependency, everything runs through lxml and HTTP requests⁴. Extracting content from pre-downloaded HTML takes single-digit milliseconds.

There's a CLI for quick jobs:

trafilatura -u "https://example.com/article"

For batch work, it handles parallel URL processing and can discover URLs through sitemaps and RSS feeds — features most extraction-only libraries don't bother with³.

One thing to keep in mind: Trafilatura works on static HTML. JavaScript-rendered content (React SPAs, for instance) needs a headless browser like Playwright first — then pass the rendered HTML to Trafilatura. The library doesn't execute JS.

The license switched from GPLv3+ to Apache 2.0 at version 1.8.0, which removed a big adoption barrier for commercial use⁴.

Citations

Adrien Barbaresi: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL-IJCNLP 2021: System Demonstrations, pp. 122-131 ↩ ↩² ↩³
Adrien Barbaresi: Personal website. Retrieved March 16, 2026 ↩
Trafilatura: Documentation. Retrieved March 16, 2026 ↩ ↩² ↩³ ↩⁴ ↩⁵
Trafilatura: PyPI package page. Retrieved March 16, 2026 ↩ ↩² ↩³
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩² ↩³
TEI Consortium: TEI: Text Encoding Initiative. Retrieved March 16, 2026 ↩
Trafilatura: Evaluation and benchmarks. Retrieved March 16, 2026 ↩

Updated: March 16, 2026