HTML explained

HyperText Markup Language (HTML) is the standard markup language for documents on the web. Every page you've ever loaded in a browser — whether it's a static blog post or a single-page app with 400KB of JavaScript — starts as HTML. The browser fetches it, parses it into a tree structure called the DOM, and renders what you see on screen.

That parsing step is exactly where content extraction gets interesting.

But before getting into how extraction tools read HTML, it helps to understand where the language came from and how it ended up the way it is. The history explains a lot about why modern HTML looks the way it does — and why scraping it is so hard.

18 tags at CERN

In March 1989, Tim Berners-Lee, a British physicist working at CERN in Geneva, submitted a proposal titled "Information Management: A Proposal" to his manager Mike Sendall1. The problem was mundane: CERN had thousands of researchers, experiments, and documents scattered across incompatible systems, and nobody could find anything. Sendall's handwritten note on the cover page — "Vague but exciting" — became one of the most famous marginalia in computing history.

By late 1990, Berners-Lee had built the first web browser (called WorldWideWeb, running on a NeXT computer), the first web server, and the first version of HTML. He published the first website on December 20, 19902.

The initial HTML was almost comically simple. Near the end of 1991, Berners-Lee published a document called "HTML Tags" that listed 18 elements: <title>, <a>, <p>, <h1> through <h6>, <ul>, <ol>, <li>, <dl>, <dt>, <dd>, <address>, and a few others like <nextid> and <isindex> that didn't survive3. Not really a specification — just a list of what the markup could do.

Most of these tags came from SGML (Standard Generalized Markup Language), specifically from an in-house CERN documentation format called SGMLguid. The hyperlink tag <a> was the genuinely new part. Everything else was borrowed.

From informal drafts to actual standards

HTML spent its first few years without any formal specification. Different browsers implemented different things, and "works in Mosaic" was the closest thing to a standard.

HTML 2.0 (1995)

The first real standard came on November 24, 1995, when the IETF published HTML 2.0 as RFC 18664. Authored by Tim Berners-Lee and Dan Connolly, it was given the "2.0" version number to distinguish it from the earlier informal versions and drafts. The spec formalized what browsers already did circa mid-1994 — forms, basic tables, images with <img>.

HTML 2.0 was already playing catch-up the day it shipped. Browsers had moved ahead of the spec, and features like file uploads and client-side image maps had to be added as supplemental RFCs.

HTML 3.2 (1997)

On January 14, 1997, the W3C published HTML 3.2 as a Recommendation5. This was the era of Netscape Navigator and the first browser wars. HTML 3.2 added tables, applets, text flow around images, and the infamous <font> tag. If you were building websites in the late 90s, you remember nested <table> layouts and <font face="Arial" color="#003366"> everywhere.

There was already CSS by this point, but Netscape 3 barely supported it. So everyone used <font> tags and spacer GIFs.

HTML 4.01 (1999)

W3C published HTML 4.0 on December 18, 1997, then a corrected HTML 4.01 on December 24, 19996. The big deal here was the separation of content from presentation — CSS was now the official way to style things, and HTML was supposed to focus on structure and semantics.

HTML 4.01 introduced three flavors: Strict (no presentational markup), Transitional (backward-compatible with older practices), and Frameset (for pages using frames). In practice, almost everyone used Transitional and kept writing presentational markup anyway.

The XHTML detour

Starting in 1998, the W3C pushed XHTML — an XML-based reformulation of HTML7. The idea was that HTML should follow XML's strict syntax rules: lowercase tags, quoted attributes, properly closed elements. <br> became <br/>. <img src=logo.png> became <img src="logo.png" />.

XHTML 1.0, published January 26, 2000, looked almost identical to HTML 4.01. The differences were mostly syntactic:

  • All tags had to be lowercase
  • All attributes had to be quoted
  • All elements had to be explicitly closed
  • Documents had to be well-formed XML

The web community was split. Some developers liked the strictness — it made documents parseable and validateable. Others found it impractical because real-world HTML was messy, browsers were forgiving, and nobody wanted their page to break because of a missing closing slash.

Then came XHTML 2.0, which the W3C planned as a ground-up rewrite that would break backward compatibility. That decision pushed things toward a breaking point.

The WHATWG fork

By 2004, web developers were frustrated. The W3C was focused on XHTML 2.0, which wouldn't be backward-compatible with existing web content. Meanwhile, browsers were implementing features that didn't exist in any spec — XMLHttpRequest for Ajax, <canvas> for drawing, things web apps actually needed.

At a W3C workshop in June 2004, representatives from Mozilla and Opera presented a proposal for evolving HTML incrementally rather than replacing it. The W3C membership voted the proposal down in favor of continuing with XML-based replacements8.

Two days later, Apple, Mozilla, and Opera formed the Web Hypertext Application Technology Working Group (WHATWG) and started working on what would become HTML58. The founding principle was pragmatic: evolve HTML based on what browsers actually implemented and what web developers actually needed, rather than pursuing theoretical XML purity.

The W3C eventually came around. In 2007, they chartered a new working group to collaborate with WHATWG on HTML5. For a while, both organizations worked from the same specification. But by 2011, they'd diverged again — the W3C wanted to publish finished, versioned specifications, while WHATWG preferred a continuously updated "living standard."

This split persisted until May 28, 2019, when W3C and WHATWG signed a Memorandum of Understanding9. The agreement gave WHATWG control over the HTML specification as a living standard, with W3C taking periodic snapshots to publish as formal Recommendations. There's now one canonical HTML spec, maintained by WHATWG at html.spec.whatwg.org.

HTML5 and semantic elements

The W3C published HTML5 as a Recommendation on October 28, 201410, though browsers had been implementing its features for years before that. HTML5 was a massive expansion of what the language could do.

The part that matters most for content extraction is the set of semantic elements: <article>, <section>, <nav>, <aside>, <header>, <footer>, <main>, <figure>, <figcaption>. These tags carry meaning about the role of their content, not just its visual appearance.

Before HTML5, a page structure typically looked like this:

<div id="header">...</div>
<div id="nav">...</div>
<div id="content">
  <div class="post">...</div>
  <div class="sidebar">...</div>
</div>
<div id="footer">...</div>

After HTML5:

<header>...</header>
<nav>...</nav>
<main>
  <article>...</article>
  <aside>...</aside>
</main>
<footer>...</footer>

Same visual result, but the second version tells you (and any program reading the HTML) what each section actually is. An <article> is self-contained content. A <nav> is navigation. An <aside> is tangential content. A machine can use these signals without guessing.

That's the theory, at least. In practice, a huge portion of the web still uses <div> for everything, with meaning conveyed only through class names and IDs — the infamous "div soup." React and other component frameworks generate particularly dense div hierarchies because each component typically wraps its output in a container element.

How browsers parse HTML into the DOM

When a browser receives an HTML document, it doesn't just render the text top-to-bottom. It runs a complex parsing algorithm, specified in detail by the WHATWG HTML standard11, that converts the raw character stream into a tree data structure called the Document Object Model (DOM).

The parser works in two stages.

Tokenization reads the raw HTML character by character and produces tokens — start tags, end tags, attributes, text nodes, comments. The tokenizer is a state machine with over 80 distinct states11. When it encounters <, it enters a "tag open" state. When it reads a letter, it starts accumulating a tag name. When it hits >, it emits a start tag token. And so on.

Tree construction takes the stream of tokens and builds the DOM tree. This stage is surprisingly forgiving — it handles malformed HTML gracefully, inserting missing tags, closing unclosed elements, and rearranging misnested markup to produce a valid tree. That forgiveness is why <p>First<p>Second works (the parser auto-closes the first <p> when it sees the second one) and why browsers can render almost any garbage HTML you throw at them.

The resulting DOM is a tree where every node represents an element, text node, comment, or other markup construct. The <html> element is the root. <head> and <body> are its children. Everything nests from there.

document
  └── html
       ├── head
       │    ├── title
       │    └── meta
       └── body
            ├── header
            │    └── nav
            ├── main
            │    ├── article
            │    │    ├── h1
            │    │    └── p
            │    └── aside
            └── footer

This tree structure is what every extraction tool operates on. It isn't reading HTML as text — it's traversing a tree.

Why the DOM matters for extraction

Content extraction tools like Trafilatura don't work with raw HTML strings. They parse the HTML into a tree (Trafilatura uses Python's lxml library for this) and then walk that tree, making decisions about which branches contain actual content and which contain navigation, ads, boilerplate, or other noise.

The extraction algorithm follows a pattern that goes roughly like this:

Pruning — First, strip out elements that almost never contain useful content: <nav>, <footer>, <script>, <style>, known ad container classes, social sharing widgets. If the page uses semantic HTML, this step is straightforward — just skip elements with semantic tags that signal non-content regions.

Scoring — For the remaining nodes, calculate relevance scores based on text density (ratio of text to markup), link density (navigation blocks are mostly links), element type, and text length. A <p> with 200 characters of text and no links scores high. A <div> with five <a> tags and 30 characters of text scores low.

Selection — Pick the subtree with the highest aggregate score as the main content. This is where semantic elements help enormously — if the page has an <article> tag wrapping the main content, the extractor can focus its scoring within that subtree instead of evaluating the entire document.

When a page is div soup with no semantic signals, the extractor has to rely entirely on heuristic scoring. It usually works, but it's less reliable and more likely to include stray sidebar fragments or miss content that's structured unusually.

This is why semantic HTML matters beyond accessibility and SEO. It makes machine reading of web pages fundamentally easier. An <article> tag is a strong signal. A <div class="post-content-wrapper-inner"> is... a guess.

Raw HTML vs. cleaned HTML in extraction pipelines

There's an important distinction that trips people up when they think about HTML in extraction contexts.

Cleaned HTML is the output of an extraction process. A tool like Trafilatura reads a page, identifies the main content, strips the boilerplate, and returns the article text — optionally formatted as HTML with semantic tags preserved. The CSS is gone, the scripts are gone, the ads are gone. What's left is the content markup: headings, paragraphs, lists, links, tables, images. You can see how different content formats compare in practice.

Raw HTML is the original page source, exactly as the server delivered it — every <div>, every <script>, every inline style, every tracking pixel, all of it. Nothing removed, nothing processed.

Both have their uses. Cleaned HTML is what you feed to an LLM or index in a search engine. Raw HTML is what you archive.

Why raw HTML is worth saving

Extraction algorithms improve over time. Trafilatura's heuristics today are better than they were two years ago, and they'll be better again two years from now. If you only save the extracted output, you're locked into whatever the extractor's quality was at the time of processing.

Save the raw HTML, and you can re-extract later with improved settings or entirely different tools. You can debug extraction failures by inspecting what the extractor actually received. You can study how a page's markup changed over time.

For archival purposes, raw HTML is the only format that preserves the complete original document. Extracted Markdown or plain text is a lossy derivative — useful, but not reversible.

Researchers building web corpora care about this. If you're collecting data for NLP training, the ability to re-process your corpus with different extraction parameters is valuable. The Internet Archive stores raw HTML for exactly this reason.

How contextractor handles HTML

This is the part where contextractor works differently from what you might expect.

When you use --save html with contextractor, it saves the raw page source as-is. No extraction. No content identification. No boilerplate removal. Just the complete HTML document that the server returned (or that the headless browser rendered, if JavaScript rendering is enabled).

This is fundamentally different from every other format. When you --save markdown, contextractor runs Trafilatura's extraction pipeline — it identifies the main content, removes the noise, and outputs the result as Markdown. Same for --save text, --save json, or --save xml. Those are all extracted content in different serialization formats.

HTML save is not extracted content. It's the raw input that extraction operates on.

Think of it this way:

FormatWhat happensOutput
--save textExtraction via TrafilaturaMain content as plain text
--save markdownExtraction via TrafilaturaMain content as Markdown
--save jsonExtraction via TrafilaturaMain content + metadata as JSON
--save xmlExtraction via TrafilaturaMain content as XML
--save htmlNo extractionRaw page source, untouched

The HTML save option exists precisely for the reasons described in the previous section — archival, debugging, and re-extraction. You run a batch job over a thousand URLs, save the raw HTML alongside your extracted Markdown, and now you have a complete record. If the extraction botched a page, you can inspect the raw HTML to figure out why. If Trafilatura ships a better algorithm next month, you can re-run extraction on the saved HTML without re-fetching the pages.

One more thing: HTML save is CLI-only. It's available through contextractor's command-line interface with the --save html flag, but it's not exposed in the playground at contextractor.com. The playground is designed for previewing extraction results, and since HTML save bypasses extraction entirely, there's nothing to preview.

The HTML to Markdown article covers the conversion side — what happens when you take that raw HTML and actually extract content from it.

The living standard

HTML doesn't have version numbers anymore — not officially. Since the 2019 WHATWG/W3C agreement, the canonical specification is the HTML Living Standard, continuously updated at html.spec.whatwg.org12. When people say "HTML5," they're really talking about the living standard plus whatever features their target browsers support.

The living standard adds new elements and APIs through a proposal process managed by the WHATWG Steering Group, which consists of representatives from the four major browser engines: Apple (WebKit), Google (Blink), Mozilla (Gecko), and Microsoft (also Blink, since Edge switched in 2019)8.

Recent additions include the <search> element (added in 2023, for wrapping search forms and related UI), the Popover API, and various improvements to form controls. The pace of change is slower than the early HTML5 days, but the specification is never "finished."

For content extraction, the living standard matters because new semantic elements keep showing up. Any extraction tool that hardcodes its understanding of HTML structure needs periodic updates. The <search> element, for instance, is a signal that the enclosed content is UI, not article text — similar to <nav>. An extractor that doesn't know about <search> might include search form markup in its output.

The web keeps evolving, and HTML evolves with it. For anyone building tools that read web pages — whether that's a browser, a screen reader, or a content extractor — understanding the structure of HTML isn't optional. It's the foundation everything else rests on.

Citations

  1. CERN: A short history of the Web. Retrieved April 14, 2026 ↩

  2. W3C: The original proposal of the WWW, HTMLized. Retrieved April 14, 2026 ↩

  3. W3C: HTML Tags. Retrieved April 14, 2026 ↩

  4. IETF RFC 1866: Hypertext Markup Language - 2.0. Retrieved April 14, 2026 ↩

  5. W3C: HTML 3.2 Reference Specification. Retrieved April 14, 2026 ↩

  6. W3C: HTML 4.01 Specification. Retrieved April 14, 2026 ↩

  7. W3C: XHTML 1.0: The Extensible HyperText Markup Language. Retrieved April 14, 2026 ↩

  8. WHATWG: FAQ. Retrieved April 14, 2026 ↩ ↩2 ↩3

  9. W3C: Memorandum of Understanding Between W3C and WHATWG. Retrieved April 14, 2026 ↩

  10. W3C: HTML5 — A vocabulary and associated APIs for HTML and XHTML. Retrieved April 14, 2026 ↩

  11. WHATWG: HTML Standard — Parsing HTML documents. Retrieved April 14, 2026 ↩ ↩2

  12. WHATWG: HTML Standard. Retrieved April 14, 2026 ↩

Updated: April 14, 2026