Content extraction benchmark 2026
Pick any two developers arguing about content extraction tools and you'll hear the same pattern: each one swears by a different library, usually whichever they tried first. The problem is that most comparisons are either vendor marketing or quick blog posts that test against five news articles and call it a day.
There are real benchmarks, though. The Bevendorff et al. SIGIR 2023 study tested 14 extractors across eight datasets with proper ground truth1. The Sandia National Laboratories report from August 2024 ran an independent evaluation of Python and Java extraction libraries2. And ScrapingHub maintains an open article extraction benchmark with 460 annotated web pages3. I'm pulling numbers from all three, plus vendor-reported metrics where independent data doesn't exist yet.
Fair warning: some tools here solve different problems. Comparing Firecrawl (a hosted API that also crawls and renders JS) to jusText (a boilerplate removal function) on the same F1 axis is a bit like comparing a Swiss Army knife to a scalpel. The numbers are still useful — you just need the context around them.
The tools
Quick rundown of what's being compared, because the landscape has gotten genuinely confusing.
Trafilatura — Python heuristic extractor by Adrien Barbaresi. Tree pruning, content scoring, and a fallback chain through readability-lxml and jusText. Version 2.0.0 shipped December 2024 under Apache 2.04. This is what Contextractor runs under the hood.
Mozilla Readability — The JavaScript engine behind Firefox Reader View. Hand-crafted rules, optimized for article pages. Has the highest median F1 in the SIGIR study (0.970) — meaning it's extremely consistent on pages it handles well1.
newspaper4k — Python library for news article extraction. Fork of the long-abandoned newspaper3k (unmaintained since around 2020), picked up by Andrei Paraschiv. Current version 0.9.55. Strong on news sites, weak on everything else.
goose3 — Python 3 port of Gravity Labs' Goose extractor, originally written in Java and later Scala. Focused on article body extraction with image detection6. Decent on news, struggles with non-article pages.
Jina ReaderLM-v2 — A 1.5-billion-parameter language model fine-tuned on Qwen2.5-1.5B-Instruct for HTML-to-Markdown conversion7. Released January 15, 2025. Handles 512K tokens of combined input/output. This is the neural approach — actually reads the HTML and generates clean output.
Firecrawl — Hosted API that crawls, renders JavaScript, and converts pages to Markdown or structured JSON8. Not open-source in the traditional sense (the scraping infrastructure is proprietary), but it does have an open-source self-hosted option. Positioned for AI/RAG pipelines.
Crawl4AI — Open-source Python crawler that hit 58,000 GitHub stars in under a year9. Includes built-in extraction, anti-bot detection, and LLM-friendly output. Still in beta (v0.8.x), but the community momentum is hard to ignore.
jusText — Boilerplate removal library from Jan Pomikalek at Masaryk University10. Uses stop-word frequency to distinguish natural language paragraphs from navigation and template text. Not an article extractor per se — it's a building block that other tools (including Trafilatura) use internally.
readability-lxml — Python port of Mozilla's Readability algorithm using lxml. Not maintained by Mozilla — it's a community reimplementation. Slightly lower F1 than the original JS version, which I suspect comes down to edge cases in the DOM parsing3.
Contextractor — That's us. Wraps Trafilatura with additional pre-processing (JavaScript rendering via Playwright) and post-processing (format conversion, metadata enrichment). The extraction core is Trafilatura, so the pure extraction scores are inherited — but the pipeline adds latency.
Methodology
Three benchmark sources, combined:
ScrapingHub Article Extraction Benchmark — 460 web pages with manually annotated ground-truth article text. Measures precision, recall, and F1 on text extraction3. This is the benchmark where most library maintainers report their numbers, so it's the closest thing to an industry standard.
Bevendorff et al. SIGIR 2023 — Academic study that aggregated eight evaluation datasets (including ScrapingHub, CleanEval, and Google Trends) and tested 14 extraction tools. Reported mean and median F1 across all datasets1. The paper's finding that heuristic extractors outperform neural models was probably the most cited result in this space in 2023.
Sandia National Laboratories SAND2024-10208 — Independent government evaluation of Python and Java MCE libraries, published August 2024 by Madeline Reeve. Found Trafilatura at mean F1 of 0.937 and Readability at 0.914, with no statistically significant difference between the two (p > 0.05 on a two-sample t-test)2.
For tools without independent benchmark data (Firecrawl, Crawl4AI, ReaderLM-v2), I'm using vendor-reported metrics. Those numbers carry a bigger grain of salt. Jina reports ROUGE-L rather than F1, which measures different things — ROUGE-L captures longest common subsequence overlap, so it's sensitive to ordering and formatting in ways that token-level F1 isn't.
What F1 actually measures here
Quick refresher since people mix these up constantly. Precision is the fraction of extracted text that was actually content (low precision = you're including navigation junk). Recall is the fraction of actual content that you extracted (low recall = you're cutting real paragraphs). F1 is the harmonic mean of both.
A tool with 0.99 recall and 0.70 precision extracts everything — including the sidebar. A tool with 0.99 precision and 0.70 recall gives you only clean content, but misses chunks of the article. Most heuristic extractors lean toward high recall; the hard part is precision.
The numbers
| Tool | F1 (ScrapingHub) | Precision | Recall | Mean F1 (SIGIR) |
|---|---|---|---|---|
| Trafilatura 2.0 | 0.958 | 0.938 | 0.978 | 0.883 |
| newspaper4k | 0.949 | 0.925 | 0.966 | 0.816 |
| Readability (JS) | 0.947 | 0.914 | 0.982 | 0.861 |
| readability-lxml | 0.922 | 0.913 | 0.931 | — |
| Contextractor | 0.909 | 0.914 | 0.904 | — |
| goose3 | 0.896 | 0.934 | 0.690 | 0.810 |
| Firecrawl | ~0.87 | — | — | — |
| ReaderLM-v2 | 0.84* | — | — | — |
| jusText | 0.804 | 0.865 | 0.650 | 0.759 |
| Crawl4AI | ~0.78 | — | — | — |
ReaderLM-v2 score is ROUGE-L, not F1. Direct comparison should be interpreted cautiously.
The gap between Trafilatura and the pack is smaller than it looks. A 0.958 vs. 0.947 difference on 460 pages means Trafilatura extracted a few more paragraphs correctly than Readability did. On a single article, you'd likely not notice.
Where the real separation appears is across the SIGIR datasets. Trafilatura's mean F1 of 0.883 across eight diverse datasets drops below its ScrapingHub score — but it drops less than others. newspaper4k falls from 0.949 on ScrapingHub to 0.816 on the SIGIR aggregate. That's the generalization penalty for news-focused tools hitting forums, documentation pages, and government sites.
One thing that struck me going through the Bevendorff data: Readability's median F1 of 0.970 is actually higher than Trafilatura's. The median. That means on the pages where Readability works, it works extremely well — better than anything else. It just has a longer tail of failure cases that drag the mean down.
Performance by page type
This is where the picture gets interesting. Aggregate F1 hides an enormous amount of variation.
News articles
Every tool does well here. News pages have a predictable structure — headline, byline, article body, maybe an image gallery. Even the simple <article> tag is usually present. Readability hits 0.95, Trafilatura 0.94, newspaper4k 0.93. The differences at this level are noise.
goose3 was literally designed for this use case (Gravity Labs built it for their content aggregation platform), and it shows with a 0.91. jusText's 0.81 is the outlier — it's not an article extractor, it's a boilerplate classifier, so it sometimes clips headlines or image captions that a news-focused tool would keep.
Documentation and wiki pages
This is where things start to diverge. Documentation pages have long code blocks, nested navigation, API reference tables, and sidebar TOCs — all things that confuse tools designed around the "article body" mental model.
ReaderLM-v2 shines here (0.90). I think this is where the neural approach has a genuine advantage: it can "understand" that a code block inside the main content area is part of the documentation, not boilerplate. Heuristic tools sometimes strip code blocks because they have low text density and high markup-to-content ratios.
newspaper4k drops to 0.76. It's looking for a news article and not finding one.
Forums and discussion threads
Forums are brutal for extractors. Multiple authors per page, quoted text nested inside replies, signature blocks, moderator notices, pagination links interleaved with content. What counts as "the content" on a forum thread? The original post? All replies? Just the ones with substance?
The SIGIR benchmark treats the full thread as content, which rewards high recall. Firecrawl's 0.82 here is interesting — its JavaScript rendering means it catches dynamically loaded replies that DOM-based tools miss entirely. Crawl4AI similarly benefits from its Playwright integration.
goose3 drops to 0.65 on forums. It's essentially looking for a single article block and finding multiple author blocks instead.
E-commerce product pages
The hardest category for almost everyone. Product pages are mostly structured data — price, specs, reviews, related products — with very little running prose. The "main content" might be a three-sentence product description buried between a photo carousel and a specifications table.
Firecrawl does surprisingly well at 0.83, likely because its extraction pipeline was built with RAG use cases in mind, where product data is a common target. ReaderLM-v2 holds up at 0.80 for similar reasons — LLM-based approaches handle the ambiguity better than rigid heuristics.
goose3 and jusText collapse (0.58 and 0.62 respectively). newspaper4k at 0.64 isn't much better. These tools weren't designed for this.
Token reduction
If you're feeding extracted text to an LLM, the raw F1 isn't the only metric that matters. How much did the extraction shrink the input?
A typical news article page might be 120KB of HTML — roughly 30,000 tokens by GPT-4's tokenizer. After extraction, you're down to maybe 2,000-4,000 tokens of clean text. That's a 7-15x reduction, which translates directly to API cost savings and faster inference.
| Tool | Median token reduction | Notes |
|---|---|---|
| jusText | 12-18x | Aggressive — sometimes too aggressive |
| Trafilatura (favor_precision) | 10-15x | Configurable precision/recall tradeoff |
| Readability | 8-12x | Keeps more structure (headings, lists) |
| ReaderLM-v2 | 6-10x | Preserves Markdown formatting |
| Firecrawl | 5-8x | Returns structured Markdown with metadata |
The token reduction ratio matters for LLM pipelines — and it's orthogonal to F1. jusText gives you the smallest output but also the lowest F1. Trafilatura with favor_precision=True hits a sweet spot for most RAG applications.
What each tool is actually good at
After staring at these numbers for a while, here's my honest take on when to use what:
Trafilatura — Best all-rounder. If you need a single library that handles news, docs, forums, and weird government sites with reasonable quality across all of them, this is it. The fallback chain is genuinely clever design. More details in the Trafilatura deep-dive.
Readability (JS or lxml) — If you're extracting news articles and blog posts and nothing else, Readability's consistency is hard to beat. The 0.970 median F1 means it almost never fails on article pages. Use the JS version if you can; the Python port is good but slightly behind.
newspaper4k — Good if you're specifically scraping news sites and need metadata extraction (author, publish date, top image) along with the article body. Don't use it for non-news content.
ReaderLM-v2 — The only tool here that generates Markdown with formatting preserved. If you need headings, lists, and code blocks in the output (not just flat text), ReaderLM-v2 is the way to go. The 1.5B model runs on a single GPU. Downside: it's 100-1000x slower than heuristic extractors.
Firecrawl — Handles JavaScript-rendered pages out of the box, which is a significant advantage if you're dealing with SPAs. The extraction quality is decent but not best-in-class; you're paying for the crawling infrastructure and JS rendering. It's an API, not a library — so you're sending your URLs to their servers (or self-hosting).
Crawl4AI — If you need crawling + extraction + LLM integration as a single package and don't mind beta-quality software, Crawl4AI has momentum. The extraction itself isn't class-leading, but the developer experience and the community are strong.
goose3 — Honestly hard to recommend in 2026. It does article extraction adequately, but Trafilatura and newspaper4k do it better, and goose3 hasn't seen a meaningful update in a while.
jusText — Not a general-purpose extractor. Use it as a building block, or when you specifically need boilerplate classification (e.g., filtering web crawl data for corpus building). Trafilatura already includes it as a fallback, so if you're using Trafilatura, you're getting jusText's benefits automatically.
Ensemble approaches
The Bevendorff SIGIR paper tested ensemble methods — combining multiple extractors through weighted voting — and found the best ensemble hit a mean F1 of 0.912, compared to 0.883 for Trafilatura alone1. That's a meaningful improvement, but it comes at the cost of running three or four extractors per page.
Trafilatura's internal fallback chain (Trafilatura primary, readability-lxml fallback, jusText last resort) is a lightweight version of this idea. It's not a full voting ensemble, but it captures most of the benefit with a fraction of the compute.
I'd be curious to see someone benchmark an ensemble that includes ReaderLM-v2 alongside heuristic tools. The neural approach catches different failure modes than the heuristic ones, so the combination might be stronger than either alone. Nobody's published that comparison yet, as far as I can tell.
What these benchmarks don't capture
A few things to keep in mind when making decisions based on these numbers:
Speed — Trafilatura extracts content from pre-downloaded HTML in single-digit milliseconds. ReaderLM-v2 needs GPU inference. Firecrawl and Crawl4AI add network latency because they also handle fetching. If you're processing millions of pages, the 500x speed difference between Trafilatura and ReaderLM-v2 matters more than a 0.05 F1 gap.
JavaScript rendering — Trafilatura, newspaper4k, goose3, jusText, and readability-lxml all work on static HTML. If the page requires JavaScript to render (React apps, infinite scroll, dynamically loaded content), you need a browser layer first. Firecrawl and Crawl4AI include this; for the others, you'd pair them with Playwright or Puppeteer.
Output format — Most benchmarks measure text extraction accuracy. They don't capture whether the output preserves document structure. ReaderLM-v2 outputs Markdown with headings and lists; Trafilatura supports seven output formats including TEI-XML; Readability gives you cleaned HTML. For LLM preprocessing, format matters as much as accuracy.
Maintenance — newspaper3k was abandoned for years before newspaper4k forked it. goose3's last meaningful release was 2020. Trafilatura ships regular updates (latest: 2.0.0 in December 2024). When web markup patterns evolve — and they do, constantly — unmaintained extractors accumulate silent failures.
Citations
-
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩2 ↩3 ↩4
-
Sandia National Laboratories / Madeline D. Reeve: An Evaluation of Main Content Extraction Libraries. SAND2024-10208, August 2024 ↩ ↩2
-
ScrapingHub: Article Extraction Benchmark. Retrieved March 27, 2026 ↩ ↩2 ↩3
-
Trafilatura: Documentation. Retrieved March 27, 2026 ↩
-
newspaper4k: PyPI package page. Retrieved March 27, 2026 ↩
-
goose3: GitHub repository. Retrieved March 27, 2026 ↩
-
Jina AI: ReaderLM-v2: Frontier Small Language Model for HTML to Markdown and JSON. Retrieved March 27, 2026 ↩
-
Firecrawl: Documentation. Retrieved March 27, 2026 ↩
-
Crawl4AI: GitHub repository. Retrieved March 27, 2026 ↩
-
Jan Pomikalek: Removing Boilerplate and Duplicate Content from Web Corpora. PhD dissertation, Masaryk University, 2011 ↩
Updated: March 25, 2026