What LLMs actually see: how HTML preprocessing impacts AI response quality
Take a news article — say, 1,200 words of reporting on a data breach. View source. The HTML weighs 870KB. That's roughly 223,000 tokens if you feed it to GPT-4 as-is1. The article text itself? Maybe 1,600 tokens. The rest is <nav> elements, inline CSS, ad container divs, tracking scripts, cookie consent markup, JSON-LD schema blocks, footer links to pages nobody visits, and a truly impressive number of data- attributes that exist solely for analytics.
You're paying for all of it. And the model is reading all of it.
This matters more than most AI developers realize. The preprocessing step — how you transform raw HTML before it enters the context window — has a measurable effect on response quality, factual accuracy, and hallucination rates. I'd argue it often matters more than which model you pick.
The token allocation problem
The median web page's HTML source (including CSS and JS) is about 870KB — roughly 223K tokens1. Strip out the JavaScript and CSS (which account for about 96% of bytes on many pages), and the remaining HTML still contains approximately 9K tokens. The actual article text? Typically under 2K tokens.
That's a signal-to-noise ratio of about 1:100 for raw HTML, or about 1:4 even after basic cleanup.
With a 128K context window, raw HTML lets you fit maybe 1.8 web pages. Extract the text first and you can fit 60. That's not a marginal improvement — it changes what's architecturally possible in a RAG pipeline.
What goes wrong with noisy input
Here's a real example I ran into. I fed Claude the raw HTML of a Washington Post article about semiconductor tariffs and asked it to summarize the key policy changes. The response included a reference to "Sign in to comment" being a notable policy feature, and it mentioned "Most Read in Business" as if it were a section of the article. Navigation labels from the sidebar leaked into the summary.
This isn't a hallucination in the traditional sense — the model found those words in the input and treated them as content. It can't distinguish a <nav> label from a paragraph tag because, in a flat token stream, they look the same.
The problem gets worse with the lost-in-the-middle effect. Liu et al. showed that language model performance drops by over 30% when relevant information sits in the middle of a long context, versus at the beginning or end2. Raw HTML almost guarantees that the actual article text lands in the middle, sandwiched between header markup and footer boilerplate. You're hitting the worst case by default.
Ad text is particularly insidious. A financial news page might have ads for investment platforms scattered through the DOM. Ask the model to extract investment advice from the article and it'll happily blend the journalist's reporting with "Open a free trading account today — 0% commission on US stocks." The model doesn't know that's an ad. It's just text in the context.
Four levels of extraction
Not all preprocessing is equal. There's a spectrum, and where you land on it affects everything downstream.
Raw HTML — shove the whole page into the prompt. This is what happens when developers skip the preprocessing step entirely, or when they use a naive fetch() and dump the response body into a template. You're burning 80%+ of your token budget on markup the model can't use productively.
Tag stripping — text_content() or a regex like /<[^>]*>/g. You get all the visible text, but "visible" includes navigation labels, sidebar widgets, related article titles, ad copy, and footer legalese. The model sees "Home About Products Contact Us The semiconductor shortage that started in 2020..." as one unbroken stream.
Heuristic extraction — tools like Trafilatura, Mozilla Readability, and newspaper4k. These parse the HTML into a DOM tree, score blocks by text density and link density, prune elements with class names like nav, sidebar, cookie-banner, and pick the highest-scoring contiguous region. The SIGIR 2023 benchmark tested 14 such tools across eight datasets and found that heuristic extractors "perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad"3. Trafilatura achieved an F1 of 0.958 on the ScrapingHub article extraction benchmark4.
ML-based extraction — Jina's ReaderLM-v2 is the current standout here: a 1.5B parameter model with 512K token context, trained specifically to convert HTML to clean Markdown5. It handles complex structures like nested tables, code blocks, and LaTeX equations that heuristic extractors sometimes mangle. The tradeoff is latency — running a separate model for extraction adds inference time that heuristic tools don't need.
The HtmlRAG counterpoint
Here's where it gets interesting. Tan et al. published a paper at WWW 2025 arguing that pruned HTML actually outperforms plain text for certain RAG tasks6. Their system, HtmlRAG, keeps HTML structure — headings, table markup, list formatting — but aggressively prunes noise through what they call block-tree-based pruning.
The idea is that converting HTML to plain text throws away structural information that helps the model understand document hierarchy. A <h2> followed by <p> tags communicates something different than a flat text block. Table structures encoded in <tr> and <td> convey relationships that disappear in plain text.
They tested across six QA datasets (ASQA, HotpotQA, NQ, TriviaQA, MuSiQue, ELI5) and found that HtmlRAG with a Phi-3.8B model matched or beat baselines across the board — 68.50% EM on ASQA versus 68.00% for the BGE baseline, for instance6.
I think this finding is real but easy to over-apply. It works because their pruning is aggressive — they're not feeding raw HTML, they're feeding carefully cleaned HTML where the noise is gone but the structure remains. That's a different thing entirely from dumping a raw page into a prompt. The insight isn't "HTML is better than text" — it's "structure carries information, and you shouldn't throw it away if you can clean the noise without losing it."
For most practical RAG applications, I'd still start with Trafilatura or a similar extractor and convert to Markdown (which preserves heading hierarchy through # / ## / ###). The HtmlRAG approach makes sense when you need table structures specifically, or when your documents have complex nested formatting that Markdown can't represent well.
The Firecrawl/Jina approach
A whole category of tools has emerged specifically for the HTML-to-LLM-context pipeline. Firecrawl spins up headless browsers and converts pages to structured Markdown. Jina Reader offers an API that turns any URL into clean Markdown or JSON — prepend r.jina.ai/ to a URL and you get extracted text back. Both target the same gap: developers building RAG systems need clean text and don't want to manage extraction infrastructure.
They're useful, but they're also black boxes. You don't control the extraction heuristics, you can't tune precision vs. recall, and you're dependent on an external service for a step that sits between your data and your model. For production systems where content extraction quality directly affects output quality, running your own extraction — even something as simple as Trafilatura — gives you more control.
What the benchmarks say
The Bevendorff et al. SIGIR 2023 study remains the most thorough comparison3. They tested 14 extractors across eight evaluation datasets and found:
| Approach | Best tool | Mean F1 | Median F1 | Notes |
|---|---|---|---|---|
| Heuristic | Trafilatura | 0.883 | — | Best overall mean |
| Heuristic | Readability | — | 0.970 | Best overall median |
| Neural | Web2Text | — | — | "Surprisingly weak" on heterogeneous pages |
| Baseline | Strip all tags | 0.738 | — | Just removing HTML tags |
No single tool won every dataset. That's honest and expected — web page structure varies wildly between news sites, forums, e-commerce, and government pages. Trafilatura's fallback chain (its own heuristic, then readability-lxml, then jusText) is specifically designed for this heterogeneity.
The fact that a simple tag-strip baseline hits 0.738 F1 tells you something: most of the work is removing the HTML itself. The remaining gap — from 0.738 to 0.958 — is where intelligent extraction earns its keep, separating navigation from content, ads from articles, sidebars from main text.
Practical recommendations
If you're building an LLM application that ingests web content — a RAG system, a research agent, a summarization pipeline — the extraction step deserves as much attention as your prompt engineering.
For batch processing of known page types (news, blog posts, documentation), Trafilatura with favor_precision=True gives you the cleanest text with minimal noise. It runs in single-digit milliseconds per page and doesn't need a GPU.
For pages with complex formatting — technical documentation with code blocks, research papers with tables and equations — ReaderLM-v2 or Markdown conversion preserves more structure, at the cost of added latency5.
For real-time web browsing agents, Firecrawl or Jina Reader handle the browser rendering and extraction in one step. Just know that you're trading control for convenience.
And if you're thinking about feeding HTML directly to a model because context windows are big enough now — they're not. A 128K window sounds generous until you realize a single web page can eat 70K tokens of it. Context windows are a budget. Extraction is how you spend that budget on signal instead of noise.
Citations
-
An Index-based Approach for Efficient and Effective Web Content Extraction. arXiv, December 2025 ↩ ↩2
-
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang: Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 2024 ↩
-
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩2
-
Adrien Barbaresi: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL-IJCNLP 2021: System Demonstrations, pp. 122-131 ↩
-
Jina AI: ReaderLM-v2: Frontier Small Language Model for HTML to Markdown and JSON. Retrieved March 27, 2026 ↩ ↩2
-
Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems. Proceedings of the ACM Web Conference 2025 ↩ ↩2
Updated: March 25, 2026