HTML Explained — The Markup Behind Web Pages

HyperText Markup Language (HTML) is the standard markup language for documents on the web. Every page you've ever loaded in a browser — whether it's a static blog post or a single-page app with 400KB of JavaScript — starts as HTML. The browser fetches it, parses it into a tree structure called the DOM, and renders what you see on screen.

That parsing step is exactly where content extraction gets interesting.

But before getting into how extraction tools read HTML, it helps to understand where the language came from. The history explains a lot about why modern HTML looks the way it does — and why scraping it is so hard.

18 tags at CERN

In March 1989, Tim Berners-Lee, a British physicist working at CERN in Geneva, submitted a proposal titled "Information Management: A Proposal" to his manager Mike Sendall¹. The problem was mundane: CERN had thousands of researchers, experiments, and documents scattered across incompatible systems, and nobody could find anything. Sendall's handwritten note on the cover page — "Vague but exciting" — became one of the most famous marginalia in computing history.

By late 1990, Berners-Lee had built the first web browser (called WorldWideWeb, running on a NeXT computer), the first web server, and the first version of HTML. He published the first website on December 20, 1990².

The initial HTML was almost comically simple. Near the end of 1991, Berners-Lee published a document called "HTML Tags" that listed 18 elements: <title>, <a>, <p>, <h1> through <h6>, <ul>, <ol>, <li>, <dl>, <dt>, <dd>, <address>, and a few others like <nextid> and <isindex> that didn't survive³. Not a specification — a list of what the markup could do.

Most of these tags came from SGML (Standard Generalized Markup Language), specifically from an in-house CERN documentation format called SGMLguid. The hyperlink tag <a> was the genuinely new part. Everything else was borrowed.

From informal drafts to actual standards

HTML spent its first few years without any formal specification. Different browsers implemented different things, and "works in Mosaic" was the closest thing to a standard.

HTML 2.0 (1995)

The first real standard came on November 24, 1995, when the IETF published HTML 2.0 as RFC 1866⁴. Authored by Tim Berners-Lee and Dan Connolly, it was given the "2.0" version number to distinguish it from the earlier informal versions and drafts. The spec formalized what browsers already did circa mid-1994 — forms, basic tables, images with <img>.

HTML 2.0 was already playing catch-up the day it shipped. Browsers had moved ahead of the spec, and features like file uploads and client-side image maps had to be added as supplemental RFCs.

HTML 3.2 (1997)

On January 14, 1997, the W3C published HTML 3.2 as a Recommendation⁵. This was the era of Netscape Navigator and the first browser wars. HTML 3.2 added tables, applets, text flow around images, and the infamous <font> tag. If you were building websites in the late 90s, you remember nested <table> layouts and <font face="Arial" color="#003366"> everywhere.

CSS already existed by this point, but Netscape 3 barely supported it. So everyone used <font> tags and spacer GIFs.

HTML 4.01 (1999)

W3C published HTML 4.0 on December 18, 1997, then a corrected HTML 4.01 on December 24, 1999⁶. The big deal here was the separation of content from presentation — CSS was now the official way to style things, and HTML was supposed to focus on structure and semantics.

HTML 4.01 introduced three flavors: Strict (no presentational markup), Transitional (backward-compatible with older practices), and Frameset (for pages using frames). In practice, almost everyone used Transitional and kept writing presentational markup anyway.

The XHTML detour

Starting in 1998, the W3C pushed XHTML — an XML-based reformulation of HTML⁷. The idea was that HTML should follow XML's strict syntax rules: lowercase tags, quoted attributes, properly closed elements. <br> became <br/>. <img src=logo.png> became <img src="logo.png" />.

XHTML 1.0, published January 26, 2000, looked almost identical to HTML 4.01. The differences were mostly syntactic:

All tags had to be lowercase
All attributes had to be quoted
All elements had to be explicitly closed
Documents had to be well-formed XML

The web community was split. Some developers liked the strictness — it made documents parseable and validateable. Others found it impractical because real-world HTML was messy, browsers were forgiving, and nobody wanted their page to break because of a missing closing slash.

Then came XHTML 2.0, which the W3C planned as a ground-up rewrite that would break backward compatibility. That decision pushed things toward a breaking point.

The WHATWG fork

By 2004, web developers were frustrated. The W3C was focused on XHTML 2.0, which wouldn't be backward-compatible with existing web content. Meanwhile, browsers were implementing features that didn't exist in any spec — XMLHttpRequest for Ajax, <canvas> for drawing, things web apps actually needed.

At a W3C workshop in June 2004, representatives from Mozilla and Opera presented a proposal for evolving HTML incrementally rather than replacing it. The W3C membership voted the proposal down in favor of continuing with XML-based replacements⁸.

Two days later, Apple, Mozilla, and Opera formed the Web Hypertext Application Technology Working Group (WHATWG) and started working on what would become HTML5⁸. The founding principle was pragmatic: evolve HTML based on what browsers actually implemented and what web developers actually needed, rather than pursuing theoretical XML purity.

The W3C eventually came around. In 2007, they chartered a new working group to collaborate with WHATWG on HTML5. For a while, both organizations worked from the same specification. But by 2011, they'd diverged again — the W3C wanted to publish finished, versioned specifications, while WHATWG preferred a continuously updated "living standard."

This split persisted until May 28, 2019, when W3C and WHATWG signed a Memorandum of Understanding⁹. The agreement gave WHATWG control over the HTML specification as a living standard, with W3C taking periodic snapshots to publish as formal Recommendations. There's now one canonical HTML spec, maintained by WHATWG at html.spec.whatwg.org.

HTML5 and semantic elements

The W3C published HTML5 as a Recommendation on October 28, 2014¹⁰, though browsers had been implementing its features for years before that. HTML5 was a massive expansion of what the language could do.

The part that matters most for content extraction is the set of semantic elements: <article>, <section>, <nav>, <aside>, <header>, <footer>, <main>, <figure>, <figcaption>. These tags carry meaning about the role of their content, not just its visual appearance.

Before HTML5, a page structure typically looked like this:

<div id="header">...</div>
<div id="nav">...</div>
<div id="content">
  <div class="post">...</div>
  <div class="sidebar">...</div>
</div>
<div id="footer">...</div>

After HTML5:

<header>...</header>
<nav>...</nav>
<main>
  <article>...</article>
  <aside>...</aside>
</main>
<footer>...</footer>

Same visual result, but the second version tells you (and any program reading the HTML) what each section actually is. An <article> is self-contained content. A <nav> is navigation. An <aside> is tangential content. A machine can use these signals without guessing.

That's the theory, at least. In practice, a huge portion of the web still uses <div> for everything, with meaning conveyed only through class names and IDs — the infamous "div soup." React and other component frameworks generate particularly dense div hierarchies because each component typically wraps its output in a container element.

How browsers parse HTML into the DOM

When a browser receives an HTML document, it doesn't just render the text top-to-bottom. It runs a complex parsing algorithm, specified in detail by the WHATWG HTML standard¹¹, that converts the raw character stream into a tree data structure called the Document Object Model (DOM).

The parser works in two stages.

Tokenization reads the raw HTML character by character and produces tokens — start tags, end tags, attributes, text nodes, comments. The tokenizer is a state machine with over 80 distinct states¹¹. When it encounters <, it enters a "tag open" state. When it reads a letter, it starts accumulating a tag name. When it hits >, it emits a start tag token. And so on.

Tree construction takes the stream of tokens and builds the DOM tree. This stage is surprisingly forgiving — it handles malformed HTML gracefully, inserting missing tags, closing unclosed elements, and rearranging misnested markup to produce a valid tree. That forgiveness is why <p>First<p>Second works (the parser auto-closes the first <p> when it sees the second one) and why browsers can render almost any garbage HTML you throw at them.

The resulting DOM is a tree where every node represents an element, text node, comment, or other markup construct. The <html> element is the root. <head> and <body> are its children. Everything nests from there.

document
  └── html
       ├── head
       │    ├── title
       │    └── meta
       └── body
            ├── header
            │    └── nav
            ├── main
            │    ├── article
            │    │    ├── h1
            │    │    └── p
            │    └── aside
            └── footer

This tree structure is what every extraction tool operates on. It doesn't read HTML as text — it traverses a tree.

Why the DOM matters for extraction

Content extraction tools like Trafilatura don't work with raw HTML strings. They parse the HTML into a tree and then walk that tree, making decisions about which branches contain actual content and which contain navigation, ads, boilerplate, or other noise.

The extraction algorithm follows a pattern that goes roughly like this:

Pruning — First, strip out elements that almost never contain useful content: <nav>, <footer>, <script>, <style>, known ad container classes, social sharing widgets. If the page uses semantic HTML, this step is straightforward — just skip elements with semantic tags that signal non-content regions.

Scoring — For the remaining nodes, calculate relevance scores based on text density (ratio of text to markup), link density (navigation blocks are mostly links), element type, and text length. A <p> with 200 characters of text and no links scores high. A <div> with five <a> tags and 30 characters of text scores low.

Selection — Pick the subtree with the highest aggregate score as the main content. This is where semantic elements help enormously — if the page has an <article> tag wrapping the main content, the extractor can focus its scoring within that subtree instead of evaluating the entire document.

When a page is div soup with no semantic signals, the extractor has to rely entirely on heuristic scoring. It usually works, but it's less reliable and more likely to include stray sidebar fragments or miss content that's structured unusually.

This is why semantic HTML matters beyond accessibility and SEO. It makes machine reading of web pages fundamentally easier. An <article> tag is a strong signal. A <div class="post-content-wrapper-inner"> is... a guess.

Raw HTML vs. cleaned HTML in extraction pipelines

One distinction trips people up when they think about HTML in extraction contexts.

Cleaned HTML is the output of an extraction process. A tool like Trafilatura reads a page, identifies the main content, strips the boilerplate, and returns the article text — optionally formatted as HTML with semantic tags preserved. The CSS is gone, the scripts are gone, the ads are gone. What's left is the content markup: headings, paragraphs, lists, links, tables, images. You can see how different content formats compare in practice.

Raw HTML is the original page source, exactly as the server delivered it — every <div>, every <script>, every inline style, every tracking pixel, all of it. Nothing removed, nothing processed.

Both have their uses. Cleaned HTML is what you feed to an LLM or index in a search engine. Raw HTML is what you archive.

Why raw HTML is worth saving

Extraction algorithms improve over time. Trafilatura's heuristics today are better than they were two years ago, and they'll be better again two years from now. If you only save the extracted output, you're locked into whatever the extractor's quality was at the time of processing.

Save the raw HTML, and you can re-extract later with improved settings or entirely different tools. You can debug extraction failures by inspecting what the extractor actually received. You can study how a page's markup changed over time.

For archival purposes, raw HTML is the only format that preserves the complete original document. Extracted Markdown or plain text is a lossy derivative — useful, but not reversible.

Researchers building web corpora care about this. If you're collecting data for NLP training, the ability to re-process your corpus with different extraction parameters is valuable. The Internet Archive stores raw HTML for exactly this reason.

How contextractor handles HTML

Contextractor draws a clean line between cleaned HTML and the original raw source — they're two separate save options, not one.

When you use --save html with the contextractor CLI, you get cleaned, extracted HTML: Contextractor runs its extraction pipeline — the Rust port of Trafilatura — identifies the main content, strips the boilerplate, and returns the result as HTML with the semantic structure (headings, lists, tables, links) preserved. It's the same extraction that powers --save markdown, --save text, and --save json; only the serialization differs.

To keep the raw page source as-is — no extraction, no content identification, no boilerplate removal, just the complete HTML document the server returned (or that the headless browser rendered, if JavaScript rendering via Playwright is enabled) — use --save original instead. That's the raw input extraction operates on.

Think of it this way:

Format	What happens	Output
`--save text`	Extraction via the Rust port	Main content as plain text
`--save markdown`	Extraction via the Rust port	Main content as Markdown
`--save json`	Extraction via the Rust port	Main content + metadata as JSON
`--save html`	Extraction via the Rust port	Main content as cleaned HTML
`--save original`	No extraction	Raw page source, untouched

The original save option exists precisely for the reasons described in the previous section — archival, debugging, and re-extraction. You run a batch job over a thousand URLs, save the raw HTML alongside your extracted Markdown, and now you have a complete record. If the extraction botched a page, you can inspect the raw HTML to figure out why. If the Rust port ships a better algorithm next month, you can re-run extraction on the saved HTML without re-fetching the pages.

Both outputs are available everywhere Contextractor runs — the CLI (each --save token pairs a format with a -kvs, -dataset, -file, or -stdout destination), the Apify actor, and the playground, where you choose an output format and can tick Original raw HTML to also receive the unmodified source.

The HTML to Markdown article covers the conversion side — what happens when you take that raw HTML and actually extract content from it.

The living standard

HTML doesn't have version numbers anymore — not officially. Since the 2019 WHATWG/W3C agreement, the canonical specification is the HTML Living Standard, continuously updated at html.spec.whatwg.org¹². When people say "HTML5," they're really talking about the living standard plus whatever features their target browsers support.

The living standard adds new elements and APIs through a proposal process managed by the WHATWG Steering Group, which consists of representatives from the four major browser engines: Apple (WebKit), Google (Blink), Mozilla (Gecko), and Microsoft (also Blink, since Edge switched in 2019)⁸.

Recent additions include the <search> element (added in 2023, for wrapping search forms and related UI), the Popover API, and various improvements to form controls. The pace of change is slower than the early HTML5 days, but the specification is never "finished."

For content extraction, the living standard matters because new semantic elements keep showing up. Any extraction tool that hardcodes its understanding of HTML structure needs periodic updates. The <search> element, for instance, is a signal that the enclosed content is UI, not article text — similar to <nav>. An extractor that doesn't know about <search> might include search form markup in its output.

The web keeps evolving, and HTML evolves with it. For anyone building tools that read web pages — whether that's a browser, a screen reader, or a content extractor — understanding the structure of HTML isn't optional. It's the foundation everything else rests on.

Citations

CERN: A short history of the Web. Retrieved April 14, 2026 ↩
W3C: The original proposal of the WWW, HTMLized. Retrieved April 14, 2026 ↩
W3C: HTML Tags. Retrieved April 14, 2026 ↩
IETF RFC 1866: Hypertext Markup Language - 2.0. Retrieved April 14, 2026 ↩
W3C: HTML 3.2 Reference Specification. Retrieved April 14, 2026 ↩
W3C: HTML 4.01 Specification. Retrieved April 14, 2026 ↩
W3C: XHTML 1.0: The Extensible HyperText Markup Language. Retrieved April 14, 2026 ↩
WHATWG: FAQ. Retrieved April 14, 2026 ↩ ↩² ↩³
W3C: Memorandum of Understanding Between W3C and WHATWG. Retrieved April 14, 2026 ↩
W3C: HTML5 — A vocabulary and associated APIs for HTML and XHTML. Retrieved April 14, 2026 ↩
WHATWG: HTML Standard — Parsing HTML documents. Retrieved April 14, 2026 ↩ ↩²
WHATWG: HTML Standard. Retrieved April 14, 2026 ↩

Updated: July 5, 2026