Why content extraction?

Open any web page in a browser and you see the article. View the source and you see something else entirely — a 150KB blob of nested <div>s, tracking scripts, navigation links, cookie consent modals, ad containers, sidebar widgets, and footer legalese. The text you actually came to read might account for a third of what's on the page. Maybe less.

Web content extraction is the process of pulling that main text out — the article body, the blog post, the product description — and throwing away everything else. It's sometimes called boilerplate removal or main content extraction, and it's been an active research problem since at least 2003, when Gupta et al. published the first formal treatment at WWW20031.

How it differs from scraping

People mix these up constantly.

Web scraping is about extracting specific structured data from pages — prices, product names, phone numbers, table rows. You typically know the page structure in advance and write selectors to target exactly the fields you want. The output is a dataset with defined columns.

Content extraction is different. You hand it an arbitrary HTML page — one you've never seen before — and it figures out which part is the "real" content. There's no predefined schema. The algorithm has to make an inferential judgment about what a human would consider the main text. That's a harder problem than it sounds, because there's no universal HTML convention for marking main content. The <article> tag exists since HTML5, sure, but most sites either don't use it or use it inconsistently.

Web crawling is yet another thing — that's the process of discovering and fetching URLs by following links. Crawling finds pages; extraction processes them.

The boilerplate problem

Christian Kohlschütter's 2010 paper at WSDM coined the framing that stuck: boilerplate detection using shallow text features2. His insight was that you don't need to understand the page semantically. You just need two numbers per text block: text density (ratio of visible characters to markup) and link density (ratio of anchor text to total text).

Navigation menus are almost entirely links. Article paragraphs rarely are. That single observation gets you surprisingly far.

The CleanEval shared task in 2007 — part of the Web as Corpus workshop — had already formalized this as a competition3. Give algorithms 734 web pages, see which one extracts the cleanest text. That benchmark drove a wave of research through the late 2000s. Later, the SIGIR 2023 paper by Bevendorff et al. tested 14 extraction tools across eight combined datasets and found something counterintuitive: heuristic approaches still beat neural models on complex, heterogeneous pages4. Deep learning is great at many things, but web page structure varies so wildly across domains that models trained on one type of site generalize poorly to others.

How extractors actually work

Most content extractors follow roughly the same pipeline, give or take some steps:

Tree pruning — Parse the HTML into a DOM tree, then strip out elements that are almost never main content: <nav>, <footer>, <aside>, script tags, style blocks, elements with class names like sidebar, cookie-banner, social-share. This is crude but effective at removing obvious noise.

Block scoring — Walk the remaining DOM nodes and score each one. Text-heavy paragraphs inside <article> or <main> elements score high. Short text fragments inside deeply nested <div>s with lots of links score low. The specific features vary by algorithm — Kohlschütter's boilerpipe uses word count and link density per block; Pomikalek's jusText uses stop-word frequency as a proxy for natural language5; Mozilla's Readability scores by character count and comma density.

Candidate selection — Pick the highest-scoring contiguous region as the main content. Some tools use a single-pass greedy approach; others (like CETR by Weninger et al.) plot tag ratios as a histogram and use clustering to find the content region6.

Fallback — Good extractors have a plan B. Trafilatura, for instance, runs its own heuristic pipeline first, then falls back to readability-lxml, then to jusText, and picks the best result. That multi-algorithm approach is a big reason why it tops the benchmarks.

What it's used for

Content extraction sits at the start of a lot of pipelines that people don't think about:

The one getting all the attention right now is RAG — retrieval-augmented generation. If you're building an LLM application that answers questions based on web sources, you need clean text to chunk and embed. Feed raw HTML into your vector database and you'll get garbage retrieval. Extraction is the data quality step that makes RAG work.

Browser reading modes are probably the oldest consumer-facing application. Instapaper launched in January 2008 as the first read-later service, built around a content extractor7. Safari Reader Mode followed in 2010, Firefox Reader View in 2015 — all of them descendants of Arc90's Readability bookmarklet from 2009.

Corpus linguistics is where Trafilatura comes from. Researchers building text corpora from web sources need clean, structured output at scale — ideally in TEI-XML format for compatibility with humanities toolchains.

LLM training data at scale. Common Crawl — which has archived over 419 TiB of web data since 2011 — distributes pre-extracted plain text files (WET format) alongside raw HTML, specifically because downstream users need boilerplate-free text8. GPT-3's training set was built from a filtered version of Common Crawl.

And then there's the mundane but real stuff: SEO auditing (extracting competitor article text for analysis), content archiving (preserving the meaningful text of pages for digital preservation), accessibility tools (text-to-speech engines that need clean prose without nav clutter).

Tools in the field

The space has accumulated a decent set of libraries over the years. A few highlights:

boilerpipe (2010, Christian Kohlschütter) — the Java library that came out of the WSDM 2010 paper. Available as boilerpy3 in Python. Pioneered the shallow-text-features approach. Not actively maintained anymore, but the ideas influenced everything that followed2.

Readability (2009, Arc90 Labs) — started as a browser bookmarklet, now maintained by Mozilla as readability.js. Powers Firefox Reader View. The Python port readability-lxml has high median accuracy but can struggle with non-article pages4.

jusText (2011, Jan Pomikalek) — came out of a PhD dissertation at Masaryk University. Uses language-specific stop-word lists to distinguish natural-language text blocks from navigational fragments5.

newspaper3k / newspaper4k (2014, Lucas Ou-Yang) — Python library focused on news articles. Includes article extraction plus metadata parsing (authors, publish dates, images). Has had long stretches without maintenance, though newspaper4k picked up development.

Trafilatura (2019, Adrien Barbaresi) — Python library that combines its own heuristic pipeline with readability and jusText as fallbacks. Best mean F1 score (0.883) in the SIGIR 2023 benchmark across eight datasets. Contextractor uses it as its extraction engine.

Why it's harder than it looks

You'd think that with two decades of research, this would be a solved problem. It's not, and there are real reasons for that.

Web pages have no obligation to follow any structural convention. A news site, a forum thread, an e-commerce product page, and a government PDF-turned-HTML all have wildly different DOM structures. An extractor that nails news articles might choke on forum posts.

JavaScript-rendered content is another headache. Single-page apps built with React or Vue load their content dynamically — the initial HTML might be an empty shell. Pure DOM-based extractors see nothing. You need a browser engine (Playwright, Puppeteer) to render the page first, then extract from the rendered DOM.

And multilingual content adds yet another layer. Stop-word based approaches like jusText need language-specific word lists. Text density heuristics work differently when the language uses logographic characters (Chinese, Japanese) versus alphabetic scripts.

The SIGIR 2023 evaluation put this in perspective: the simple baseline of stripping all HTML tags and returning everything achieves an F1 of 0.7384. That sounds decent until you realize the remaining 26% gap is where all the hard cases live — and those hard cases are often the ones that matter most.

Citations

  1. Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm: DOM-based Content Extraction of HTML Documents. Proceedings of WWW 2003, pp. 207-214 ↩

  2. Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl: Boilerplate Detection using Shallow Text Features. Proceedings of WSDM 2010 ↩ ↩2

  3. Marco Baroni, Francis Chantree, Adam Kilgarriff, Serge Sharoff: CleanEval: a Competition for Cleaning Web Pages. Proceedings of the 3rd Web as Corpus Workshop, 2007 ↩

  4. Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩2 ↩3

  5. Jan Pomikalek: Removing Boilerplate and Duplicate Content from Web Corpora. PhD dissertation, Masaryk University, 2011 ↩ ↩2

  6. Tim Weninger, William Hsu, Jiawei Han: CETR — Content Extraction via Tag Ratios. Proceedings of WWW 2010, pp. 971-980 ↩

  7. Marco Arment: The first read-later service. marco.org, February 21, 2013 ↩

  8. Common Crawl Foundation: Common Crawl. Retrieved March 4, 2026 ↩

Updated: March 4, 2026