Heuristic vs. ML-powered extraction — Trafilatura vs. Jina ReaderLM

Two tools that solve roughly the same problem — getting useful text out of HTML — but with architectures that couldn't be more different. Trafilatura runs a cascade of hand-tuned heuristics with fallback algorithms. Jina ReaderLM-v2 throws a 1.54-billion-parameter transformer at it and lets the model figure out what's content and what's noise.

The interesting question isn't which one is "better." It's when each approach falls apart.

How Trafilatura works

I've covered this in depth in the Trafilatura article, but the short version: it's a multi-stage pipeline with no machine learning anywhere in the core extraction path.

Architecture comparison between Trafilatura heuristic pipeline and ReaderLM-v2 neural model

Heuristic pipeline vs neural model architecture

Raw HTML gets parsed into an lxml tree. Tree pruning strips elements that almost never contain article content — <nav>, <footer>, <aside>, known ad container class names. Then content scoring kicks in: each remaining node gets rated on text density (how much actual text versus markup) and link density (navigation blocks are overwhelmingly links; article paragraphs aren't)¹.

The clever bit is what happens when the initial result looks wrong. If the output seems too short or too noisy, Trafilatura falls back to readability-lxml — a Python port of Mozilla's Readability algorithm. Still not great? It tries jusText, a boilerplate removal tool from Masaryk University². The outputs get compared, and the best one wins.

Set fast=True to skip the fallbacks entirely. You lose some robustness but roughly double throughput.

How ReaderLM-v2 works

Jina AI released the first generation of Reader-LM in September 2024 — two small models (0.5B and 1.5B parameters) trained specifically on HTML-to-Markdown conversion³. That first version had a nasty degeneration problem: after generating long sequences, it would start repeating tokens or looping through short patterns until hitting the max output length.

ReaderLM-v2 shipped in January 2025 as the fix. Built on Qwen2.5-1.5B-Instruct as the base model, it has 28 transformer layers, 12 query heads, 2 KV heads, and handles up to 512K tokens of combined input and output⁴. The key training innovation was adding contrastive loss to discourage repetitive token representations — and it works. Performance stays consistent regardless of how many tokens have already been generated⁵.

The training pipeline is interesting on its own. Jina built a dataset called html-markdown-1m — roughly one million HTML documents averaging 56,000 tokens each. But instead of relying on rule-based HTML-to-Markdown converters to create ground truth (which would bake in the limitations of those converters), they used a three-stage synthetic data approach: draft, refine, critique — all driven by Qwen2.5-32B-Instruct⁵. The model then went through long-context pretraining, supervised fine-tuning, direct preference optimization, and self-play reinforcement tuning.

That's a lot of moving parts for a "small" model.

ReaderLM-v2 supports 29 languages and can output both Markdown and structured JSON (with schema enforcement). The JSON capability is new — you provide a JSON schema and the model extracts matching fields from HTML, hitting a 98% pass rate for valid JSON conforming to the schema⁵.

The benchmark problem

Comparing these two is trickier than it looks, because they don't quite solve the same task.

Trafilatura does content extraction — it identifies which parts of a page are the main article and discards everything else. The output is clean text (or Markdown, XML, etc.) of just the content a human came to read. Its benchmarks measure extraction quality: did it correctly separate content from boilerplate? The ScrapingHub article extraction benchmark and the SIGIR 2023 study by Bevendorff et al. both evaluate this⁶.

ReaderLM-v2 does HTML-to-Markdown translation — it takes HTML (which might already be clean or might still contain boilerplate) and converts it into well-formatted Markdown. Its benchmarks measure conversion quality: how faithfully does the Markdown reproduce the original content's structure, headings, tables, and formatting?

These aren't the same thing. An extractor that perfectly identifies article content but outputs plain text with no formatting would score high on extraction benchmarks and terribly on Markdown fidelity benchmarks. A converter that perfectly translates all HTML to Markdown — including the navigation menu and footer — would score well on conversion benchmarks but badly on extraction ones.

That said, ReaderLM-v2 does have a "main content" extraction mode. And in practice, people use both tools for the same downstream purpose: getting clean, structured text out of web pages, typically for LLM pipelines or RAG systems.

Numbers

Here's what the benchmarks actually show, keeping the caveat above in mind:

Metric	Trafilatura 2.0	ReaderLM-v2	ReaderLM-v2-pro
F1 (ScrapingHub)	0.958	—	—
F1 (SIGIR 2023, mean)	0.883	—	—
ROUGE-L (main content)	—	0.84	0.86
Jaro-Winkler	—	0.82	0.83
WER	—	0.62	0.39
Processing speed	~2-10 ms/page	~36 tokens/s output	~36 tokens/s output
Hardware	CPU only	GPU (T4 minimum)	GPU (T4 minimum)

The F1 and ROUGE-L numbers aren't directly comparable (different benchmarks, different evaluation criteria), but they give a sense of the ballpark. Trafilatura's 0.958 F1 on ScrapingHub is essentially "gets it right on almost every page in the dataset." ReaderLM-v2's 0.84 ROUGE-L on main content extraction means it captures most of the content with good structural fidelity, but there's room for improvement⁵.

For context, GPT-4o scores 0.69 ROUGE-L on the same Jina benchmark, and Qwen2.5-32B-Instruct gets 0.71³. A 1.5B model beating 32B models at a specific task is genuinely notable — though it says more about task-specific fine-tuning than about raw capability.

Speed vs accuracy positioning of different extraction tools

Speed and resource requirements

This is where the gap gets dramatic.

Trafilatura processes a page in single-digit milliseconds on a regular CPU. No GPU. No model weights to load. pip install trafilatura, feed it HTML, get text back. Memory footprint is negligible — lxml parsing plus some string manipulation. You can extract thousands of pages per second on a single machine⁷.

ReaderLM-v2 needs a GPU. On a free-tier Colab T4, it manages 67 tokens/s input and 36 tokens/s output⁴. A typical web page might be 5,000-50,000 tokens of HTML, so processing a single page can take anywhere from seconds to minutes depending on length. The T4 also lacks bfloat16 and Flash Attention 2, so production deployments on RTX 3090/4090 would be faster — but you're still looking at orders-of-magnitude slower than heuristic extraction.

For batch processing of web corpora — Common Crawl's billions of pages, or even a few million pages for a training dataset — this speed difference isn't academic. It's the difference between "run overnight on a laptop" and "spin up a GPU cluster for a week."

Where heuristics win

The SIGIR 2023 study is the most thorough independent comparison of content extraction approaches to date. Bevendorff et al. tested 14 extractors across eight datasets and concluded that heuristic extractors "perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad"⁶.

That's a strong statement. It was published before ReaderLM-v2 existed, so it doesn't include Jina's model specifically, but the finding held across every neural model they tested.

Why? Heuristics like text density and link density are structural properties of HTML. They don't depend on the language of the text, the topic of the page, or patterns the model happened to see during training. A news article in Thai and a government report in Finnish share the same structural characteristics: low link density in content blocks, high link density in navigation. Heuristics pick up on that signal regardless⁶.

Neural models, by contrast, can overfit to the distribution of their training data. Show a model page structures it hasn't encountered before — unusual CMS templates, legacy HTML, pages that don't follow common conventions — and accuracy drops.

Trafilatura's fallback chain makes this even more pronounced. If its own heuristic gets confused, readability-lxml takes a different approach, and jusText takes yet another. Three algorithms with different failure modes rarely all fail on the same page.

Where ML wins

Heuristics are brittle in a different way. They follow rules, and rules break on edge cases that don't fit the patterns the rules were designed for.

Tables and complex structures — Trafilatura can extract text from tables (with include_tables=True), but the output is flat text. ReaderLM-v2 generates actual Markdown tables with aligned columns, merged cells handled reasonably, and proper pipe syntax. For HTML-to-Markdown conversion where structure matters, this is a real advantage³.

Nested lists and code blocks — Similarly, deeply nested lists and code fences come out as proper Markdown from ReaderLM-v2. Heuristic extractors tend to flatten these.

LaTeX and mathematical notation — ReaderLM-v2 preserves LaTeX formulas (both inline and display). A heuristic extractor would just see raw text and dollar signs.

Format fidelity — If your downstream task cares about headings, emphasis, link targets, and document structure — not just the raw text — a model trained specifically to produce well-formatted Markdown has an inherent edge over a tool designed to separate content from boilerplate.

Novel page layouts — A well-trained model can potentially generalize to page structures it hasn't seen exactly before, because it's learned abstract patterns about content versus chrome. Heuristics only know the rules they've been given. (Though in practice, the SIGIR results suggest this advantage is smaller than you'd expect.)

Failure modes

Both tools fail. They just fail differently.

Trafilatura fails on JavaScript-rendered content — it processes static HTML only, so React SPAs or pages that load content dynamically require a headless browser first. It can also struggle with pages where the "content" doesn't follow standard patterns: product pages, interactive tools, pages that are mostly images with captions. The fallback chain helps, but if all three algorithms get confused by the same page structure, you get garbage or nothing.

ReaderLM-v2 has its own issues. The first-generation model had severe repetition problems; v2 fixes most of this with contrastive loss, but users have still reported hallucinations on some pages⁸. The model can generate Markdown that looks plausible but doesn't accurately reflect the source HTML — an inherent risk of any generative approach. On Jina's own qualitative evaluation (manual scoring of 10 HTML pages), GPT-4o actually scored higher than ReaderLM-v2 on structural accuracy and format compliance⁵. And there's the licensing issue: ReaderLM-v2 is CC BY-NC 4.0, which means no commercial use without a separate agreement⁴.

Practical decision matrix

Factor	Trafilatura	ReaderLM-v2
Need: batch corpus building	Strong choice	Impractical at scale
Need: RAG pipeline preprocessing	Good default	Overkill for most pages
Need: faithful Markdown output	Adequate	Better structural fidelity
Need: table preservation	Basic (flat text)	Good (Markdown tables)
Need: multilingual content	Works (language-agnostic)	Works (29 languages)
Need: no GPU available	Only option	Not an option
Need: commercial license	Apache 2.0 since v1.8.0	CC BY-NC 4.0 (restricted)
Need: JSON schema extraction	Not supported	Built-in, 98% pass rate

I'd reach for Trafilatura as the default for most content extraction pipelines. It's fast, accurate on the standard case, runs anywhere, and the Apache 2.0 license doesn't restrict commercial use⁹. The fallback chain handles most of the edge cases that trip up simpler heuristic tools.

ReaderLM-v2 makes sense when you specifically need structured Markdown output with formatting preserved — tables, nested lists, headings, code blocks — and you have GPU resources available. The JSON extraction capability is also unique. But for extracting article text from web pages at scale, the heuristic approach is still the pragmatic choice.

That the SIGIR researchers found neural models "surprisingly bad" at content extraction should give anyone pause before reaching for a transformer where a heuristic will do. ML isn't magic, and for a task with strong structural signals in the input, hand-crafted rules still win more often than the ML hype would suggest.

Citations

Adrien Barbaresi: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL-IJCNLP 2021: System Demonstrations, pp. 122-131 ↩
Jan Pomikalek: Removing Boilerplate and Duplicate Content from Web Corpora. PhD dissertation, Masaryk University, 2011 ↩
Jina AI: ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON. Retrieved March 27, 2026 ↩ ↩² ↩³
Jina AI: ReaderLM-v2 Model Card. Hugging Face. Retrieved March 27, 2026 ↩ ↩² ↩³
Jina AI: ReaderLM-v2: Small Language Model for HTML to Markdown and JSON. arXiv:2503.01151, 2025 ↩ ↩² ↩³ ↩⁴ ↩⁵
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩² ↩³
Trafilatura: Documentation. Retrieved March 27, 2026 ↩
Ollama: reader-lm - heavy hallucinations?. GitHub issue, 2024 ↩
Trafilatura: PyPI package page. Retrieved March 27, 2026 ↩

Updated: March 25, 2026