Web content extraction for LLMs
Feed a raw web page into an LLM and watch what happens.
The model receives 223,000 tokens of HTML — scripts, stylesheets, navigation menus, ad containers, tracking pixels, cookie consent banners, footer links, JSON-LD schema blocks — and somewhere in the middle, the 1,800 tokens of article text you actually wanted. According to HTTP Archive data, the median web page weighs 870KB of HTML source1. That's roughly 890,000 characters. Most of it is noise.
This isn't an abstract problem. Every token you send to an LLM costs money, consumes context window space, and — critically — dilutes the signal your model is trying to work with. If you're building a RAG pipeline, the quality of your extracted text determines the quality of your retrieval. If you're assembling training data, noise in your corpus becomes noise in your model's weights.
Content extraction is the process of pulling the actual article text out of that HTML mess. For a primer on how extractors work at a mechanical level — DOM tree pruning, block scoring, fallback chains — see the content extraction fundamentals. This guide focuses specifically on the LLM angle: why extraction matters for AI workloads, what methods exist, and how to pick the right one.
The token waste problem
Here's what a typical news article looks like at each stage of extraction:
The numbers are striking. Raw HTML tokenizes to roughly 223,000 tokens for a page whose actual content is about 1,800 tokens. That's a 99.2% waste rate. Even after stripping all JavaScript and CSS (which removes about 96% of the bytes), you still have ~9,000 tokens of markup — navigation bars, sidebars, related article links, footer legalese — surrounding the content1.
Why does this matter beyond cost?
Context window pollution. When a retrieval system stuffs raw or poorly-cleaned HTML into an LLM's context, the model has to figure out what's content and what's chrome. It sometimes gets confused. Navigation labels bleed into answers. Boilerplate text from cookie banners shows up in summaries. I've seen RAG systems confidently cite sidebar "Related Articles" text as if it were the main content, because from the model's perspective, it's all just tokens in the window.
Retrieval degradation. Vector embeddings generated from noisy text produce noisy retrieval. If your chunk includes three paragraphs of article text and two paragraphs of "Recommended For You" links, the embedding captures a muddled signal. Your cosine similarity scores drop. The wrong chunks get retrieved. The downstream answer quality tanks — and you might not even notice, because the system still produces fluent text.
Cost at scale. At GPT-4o pricing of $2.50 per million input tokens, processing 1,000 raw pages costs about $0.56. After extraction, the same 1,000 pages cost $0.00452. The difference is negligible for a single query, but for a production pipeline processing millions of pages? It's the difference between a $5,600 monthly bill and a $45 one.
The spectrum of extraction approaches
Not all extraction is created equal. The methods range from braindead simple to surprisingly sophisticated, and the right choice depends on your constraints.
Regex stripping
The cheapest possible approach: strip all HTML tags with a regular expression, maybe collapse whitespace. Something like re.sub(r'<[^>]+>', '', html) gets you there in a single line.
It's fast — sub-millisecond — and it's terrible. You lose all structure. Navigation text mashes into article text. Script content that wasn't in <script> tags (inline event handlers, template literals) bleeds through. Table data becomes an unreadable string. There's no way to distinguish a <p> in the main article from a <p> in the sidebar.
That said, I've seen teams use this as a first pass in latency-sensitive agent loops where the LLM just needs a rough sense of what's on a page. If your use case is "does this page mention topic X?" and you can tolerate false positives, regex stripping might be enough.
DOM heuristics
This is where the real work happens, and it's what most production systems should be using.
DOM-based heuristic extractors parse the HTML into a tree, prune elements that are statistically unlikely to be content (<nav>, <footer>, <aside>, known ad classes), then score remaining nodes on signals like text density and link density. High text-to-markup ratio and low link density? Probably article content. High link density? Probably navigation.
The major tools in this category:
- Trafilatura — Python library with a multi-algorithm fallback chain (its own heuristics, then readability-lxml, then jusText). Achieved F1 of 0.958 on the ScrapingHub benchmark and best mean F1 across eight datasets in the SIGIR 2023 evaluation3. Outputs plain text, markdown, HTML, XML, or TEI-XML.
- Readability — Mozilla's algorithm, originally built for Firefox Reader View. Scores DOM nodes by character count, comma density, and class name heuristics. Had the highest median F1 in the SIGIR 2023 study (0.970), though its mean was lower than Trafilatura's3. Available as readability-lxml in Python, @mozilla/readability in JavaScript.
- newspaper4k — Python library focused on news articles. Extracts text, authors, publish dates, images. Successor to newspaper3k, which went through years of being unmaintained.
- jusText — Boilerplate removal by stop-word frequency analysis. Originally from Masaryk University4. Works well as a component (Trafilatura uses it as a fallback) but less effective standalone on complex layouts.
The SIGIR 2023 benchmark by Bevendorff et al. — which tested 14 extractors across eight datasets — found that heuristic approaches "perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad"3. That result surprised a lot of people, but it makes sense. Web pages are wildly heterogeneous. A model trained on news articles generalizes poorly to forums, government sites, or e-commerce pages.
ML-powered extraction
Despite the SIGIR findings, there's a growing category of ML-based extractors designed specifically for LLM pipelines.
Jina ReaderLM-v2 is a 1.5B-parameter model based on Qwen2.5-1.5B-Instruct, trained on a million HTML documents to convert raw HTML directly into clean markdown or structured JSON5. It handles up to 512K tokens of combined input and output, supports 29 languages, and generates complex elements like nested lists, tables, and LaTeX equations. The trade-off is speed: the IndexLM benchmark clocked ReaderLM-v2 at 97.5 seconds per page, compared to 1.4 seconds for their own index-based model1.
That latency makes ReaderLM-v2 impractical for large-scale crawling, but potentially useful for small batches where extraction quality on complex layouts matters more than throughput.
IndexLM takes a different approach entirely. Instead of generating output token by token, it partitions HTML into structure-aware segments and predicts which segment indices contain the relevant content1. This decouples extraction latency from content length — a significant advantage when processing pages with hundreds of thousands of tokens. Their 4B model achieved a 53% token reduction over raw HTML while maintaining extraction quality.
Hybrid and LLM-native tools
A new class of tools has emerged that combines crawling, rendering, and extraction into a single pipeline aimed at LLM consumption.
Crawl4AI is an open-source Python crawler that outputs clean markdown without requiring external API calls6. It launched in mid-2024 and hit 58,000+ GitHub stars within a year — a sign of how much demand exists for this exact workflow. It handles JavaScript rendering via Playwright, extracts content, and formats the output specifically for RAG ingestion or direct LLM prompting. Recent versions added adaptive crawling where the system learns reliable selectors over time.
Firecrawl offers similar functionality as an API service, converting web pages to markdown or structured JSON with JavaScript rendering, proxy management, and configurable extraction7. It's less of a library and more of a managed service — you send a URL, you get clean markdown back.
Both of these tools essentially bundle a headless browser with a content extractor, which solves the JavaScript-rendered content problem that DOM-only extractors like Trafilatura can't handle on their own.
Choosing an extraction method
The right extraction method depends on three variables: what you're building, how much data you're processing, and how much latency you can tolerate.
RAG pipelines
For RAG, extraction quality directly determines retrieval quality. You want clean, well-chunked text with preserved heading structure — headings help semantic chunking algorithms find natural break points.
The default choice is a heuristic extractor outputting markdown. Trafilatura with markdown output preserves headings and lists while stripping boilerplate, and it runs in single-digit milliseconds per page on pre-fetched HTML. For a RAG pipeline processing a few thousand pages, this is the sweet spot.
One interesting exception: the HtmlRAG paper (accepted at WWW 2025) found that keeping pruned, cleaned HTML — not plain text — as the retrieval format actually produced better answers on six QA benchmarks8. Their argument is that HTML structure (tables, heading hierarchy, list nesting) encodes semantic information that plain text loses. They proposed a block-tree pruning strategy that keeps the structure while cutting the noise. It's worth testing if your RAG sources are table-heavy or structurally complex.
Training and fine-tuning data
Scale matters here. If you're processing millions of pages for a training corpus, you need throughput above all. Trafilatura with fast=True (which skips the fallback chain) processes pages in low single-digit milliseconds. The F1 drops slightly, but at corpus scale, the occasional extraction error is noise in the aggregate.
For smaller, curated datasets where every document matters, an ML extractor like ReaderLM-v2 might justify the latency cost. Its ability to preserve complex formatting — code blocks, mathematical notation, table structure — can be valuable for specialized domains.
Common Crawl already provides pre-extracted text in its WET format, derived from over 2 billion pages per monthly crawl9. If your training data comes from Common Crawl, extraction is already done — though the quality of their extraction pipeline may not match a purpose-built one.
Real-time agents
AI agents that browse the web need extraction too, but they operate under tighter latency constraints. A user waiting for an agent to read a web page and answer a question doesn't want to wait 97 seconds for ReaderLM-v2 to process the HTML.
For agent use cases, Crawl4AI or a similar pipeline tool makes sense — it handles rendering and extraction in a single step. If you're building on top of existing infrastructure and just need quick text, regex stripping with some smart heuristics (remove <script>, <style>, <nav>, <footer> before stripping tags) can get you surprisingly far at near-zero latency.
Extraction quality and downstream LLM performance
Does extraction quality actually affect LLM output quality? The answer is yes, measurably — but the relationship isn't always linear.
The Sandia National Laboratories evaluation (2024) tested six content extraction libraries and found significant variance in extraction quality across page types10. Trafilatura scored highest in mean F1, but no extractor dominated every page category. News articles were easy; forum threads and e-commerce pages were hard.
Here's what I find underappreciated: the extraction failure modes matter more than the average F1. A RAG system that occasionally returns a perfect answer and occasionally returns garbage (because it retrieved a chunk full of sidebar text) is worse, from a user trust perspective, than one that returns consistently decent answers. Precision matters more than recall for most LLM applications. You'd rather miss a paragraph of content than include a paragraph of navigation links.
This is why Trafilatura's favor_precision=True flag is relevant for LLM pipelines — it applies more aggressive filtering, which may drop some legitimate content but dramatically reduces noise in the output.
A practical token budget comparison
Here's a concrete example of how extraction choice affects an LLM workflow. Say you're building a RAG system that retrieves 5 source pages per query.
| Extraction method | Tokens per page | 5 pages total | Context % used (128K window) | Monthly cost (1M queries) |
|---|---|---|---|---|
| None (raw HTML) | ~223,000 | 1,115,000 | 871% (won't fit) | $2,787 |
| Regex strip | ~12,000 | 60,000 | 47% | $150 |
| Heuristic (Trafilatura) | ~2,000 | 10,000 | 7.8% | $25 |
| ML (ReaderLM-v2) | ~1,900 | 9,500 | 7.4% | $24 |
Raw HTML literally doesn't fit in the context window for a 5-page retrieval. Regex stripping gets you under the limit but burns half your context on noise. Heuristic extraction leaves 92% of your context window available for the system prompt, conversation history, and generation.
The cost difference between heuristic and ML extraction is negligible. The quality difference depends on your source pages. For most web content, heuristic extraction is enough. For pages with complex tables, nested lists, or embedded code — ML extraction may preserve more useful structure.
The pipeline in practice
A typical extraction pipeline for LLM workloads looks like this:
Fetch — HTTP request or headless browser (Playwright, Puppeteer) for JS-rendered pages. This is where Crawl4AI or Firecrawl help, because they bundle this step with extraction.
Extract — Run the fetched HTML through your chosen extractor. If you're using Trafilatura, that's a single function call: trafilatura.extract(html, output_format='markdown').
Chunk — Split the extracted text into chunks for embedding. Heading-aware chunking (splitting at ## boundaries in markdown output) tends to produce more semantically coherent chunks than fixed-size splitting.
Embed and index — Generate vector embeddings and store them in your vector database. Clean input text here means cleaner embeddings, which means better retrieval.
The extraction step takes single-digit milliseconds for heuristic tools — it's rarely the bottleneck. The bottleneck is usually the fetch step (network latency, JavaScript rendering) or the embed step (API calls to embedding models).
One thing teams often get wrong: they optimize their embedding model, their chunking strategy, their reranking — and never look at what the extractor is actually producing. Run a manual spot check on 50 extracted pages. If your extractor is including navigation text, cookie banners, or "Related Articles" blocks, no amount of downstream optimization will fix the retrieval quality problem.
What's next for extraction
The field is moving in a few directions simultaneously.
ML-based extraction will likely catch up to heuristic methods on quality while becoming faster. IndexLM's approach of predicting content indices rather than generating text is architecturally promising — it sidesteps the latency problem that makes ReaderLM-v2 impractical at scale.
The HtmlRAG finding — that structured HTML can outperform plain text for RAG — may shift the default away from "extract to plain text" toward "extract to pruned HTML." This would require changes to how we chunk and embed content, but the potential quality gains are real.
Multilingual extraction is still an open problem. A 2025 SIGIR paper by Bournonville et al. published the first multilingual extraction benchmark, testing across multiple languages and finding that most extractors struggle outside of English-language web content11. As LLM applications expand globally, this gap will need closing.
And the sheer volume of web content keeps growing. Common Crawl captures over 2 billion pages per monthly crawl9, and the extraction quality on that corpus directly shapes what models trained on it can do. The extraction step is invisible to most users, but it's one of the most consequential preprocessing decisions in the entire LLM stack.
Citations
-
An Index-based Approach for Efficient and Effective Web Content Extraction. arXiv:2512.06641, December 2025 ↩ ↩2 ↩3 ↩4
-
OpenAI: API Pricing. Retrieved March 27, 2026 ↩
-
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩2 ↩3
-
Jan Pomikalek: Removing Boilerplate and Duplicate Content from Web Corpora. PhD dissertation, Masaryk University, 2011 ↩
-
Jina AI: ReaderLM-v2. Hugging Face model card. Retrieved March 27, 2026 ↩
-
Crawl4AI: GitHub repository. Retrieved March 27, 2026 ↩
-
Firecrawl: Documentation. Retrieved March 27, 2026 ↩
-
Tian et al.: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems. Proceedings of WWW 2025 ↩
-
Common Crawl Foundation: Common Crawl. Retrieved March 27, 2026 ↩ ↩2
-
Sandia National Laboratories: An Evaluation of Main Content Extraction Libraries. SAND2024-10208, August 2024 ↩
-
Bournonville et al.: Multilingual Benchmarking of Main Content Extractors. SIGIR 2025 ↩
Updated: March 23, 2026