XML explained
XML (Extensible Markup Language) is a markup language for encoding documents and data in a format that's both human-readable and machine-parseable. The W3C published the XML 1.0 specification as a Recommendation on February 10, 19981, and while the technology landscape around it has shifted dramatically since then, XML itself hasn't changed much. The fifth edition of XML 1.0, published in 2008, isn't even a new version — it's just accumulated errata fixes2.
That stability is either XML's greatest strength or its biggest handicap, depending on who you ask.
The SGML backstory
XML didn't come out of nowhere. Its parent is SGML (Standard Generalized Markup Language), and to understand XML's design decisions, you need to know a bit about where SGML came from.
In 1969, Charles Goldfarb, Edward Mosher, and Raymond Lorie at IBM developed GML — Generalized Markup Language — named, with no attempt at subtlety, after their own initials3. GML introduced the idea of separating document content from its presentation: instead of embedding formatting instructions directly into text (the way word processors did), you'd mark up the structure — headings, paragraphs, lists — and let a separate process handle how those structures got rendered.
The idea took hold. In 1978, an ANSI committee chaired by Goldfarb started working on a standard that would generalize GML's approach, and after eight years of drafting and revision, ISO published SGML as ISO 8879:1986^iso-sgml-8879.
SGML was powerful. It could describe almost any document structure you could imagine. It was also sprawling, complex, and expensive to implement. A conforming SGML parser was a serious piece of software — the kind of thing you'd find in publishing houses and government agencies, not on someone's desktop. By the mid-1990s, the web was exploding, and it needed something lighter.
From SGML to XML: the W3C working group
Jon Bosak at Sun Microsystems saw the gap. In 1996, he organized an XML working group at the W3C with one specific mandate: create a subset of SGML that was simple enough for the web but expressive enough to actually be useful4.
The core design decisions came fast — between August and November 1996, the group produced the first working draft. Tim Bray (Textuality and Netscape), Jean Paoli (Microsoft), and C. M. Sperberg-McQueen (W3C) served as editors. The working group included people from Adobe, ArborText, Hewlett-Packard, Microsoft, Netscape, Sun Microsystems, and a dozen other organizations5.
The result was ten design goals, stated right there in section 1.1 of the spec2:
- XML shall be straightforwardly usable over the Internet
- XML shall support a wide variety of applications
- XML shall be compatible with SGML
- It shall be easy to write programs which process XML documents
- The number of optional features in XML is to be kept to the absolute minimum, ideally zero
- XML documents should be human-legible and reasonably clear
- The XML design should be prepared quickly
- The design of XML shall be formal and concise
- XML documents shall be easy to create
- Terseness in XML markup is of minimal importance
That last one is worth pausing on. The people who designed XML explicitly decided they didn't care about keeping it short. They cared about clarity, formality, and zero ambiguity. This single decision explains a lot about why XML looks the way it does — and why, years later, developers would gravitate toward JSON's more compact syntax for web APIs.
When XML 1.0 was published on February 10, 1998, Bosak called it "a key technical advance enabling secure electronic commerce and a new generation of distributed applications." Sperberg-McQueen said it was "a great step forward toward the original goals of the World Wide Web"5. The W3C Plenary later honored Bosak by reserving the XML name xml:Father for him in perpetuity6 — which is a very XML way to express gratitude.
XML syntax: the basics
If you've ever seen HTML, you can read XML. But XML is stricter — a lot stricter.
An XML document has elements (the building blocks), attributes (metadata on elements), and text content. Here's a minimal example:
<?xml version="1.0" encoding="UTF-8"?>
<book isbn="978-0-596-00420-7">
<title>XML in a Nutshell</title>
<author>Elliotte Rusty Harold</author>
<year>2004</year>
</book>
The <?xml ... ?> line is the XML declaration — it tells the parser which version of XML this is and what character encoding to expect. It's optional but universally recommended.
Well-formedness vs. validity
This is a distinction that trips up a lot of people.
A well-formed XML document follows the syntactic rules: every opening tag has a closing tag (or uses self-closing syntax like <br/>), elements are properly nested, attribute values are quoted, and there's exactly one root element. If any of these rules are broken, the parser must reject the document. Not "try to fix it" — reject it outright. This is the exact opposite of how HTML parsers behave, where browsers will happily render a <p> that's never closed.
A valid XML document is well-formed and conforms to a schema that defines what elements and attributes are allowed, in what order, and with what constraints. The original schema language was the DTD (Document Type Definition), inherited directly from SGML. DTDs can describe document structure, but they can't express data types — you can say "this element contains text" but you can't say "this element contains an integer between 1 and 100."
That's where XML Schema (XSD) came in. Published as a W3C Recommendation in May 20017, XSD added data typing, namespaces support, and a much richer constraint vocabulary. The irony: XSD documents are themselves XML, which makes them powerful but also notoriously verbose.
Namespaces
Namespaces solve the name-collision problem. If you're mixing elements from different XML vocabularies — say, XHTML and SVG in the same document — you need a way to tell them apart. An <table> in XHTML is a data grid; an <table> in a furniture catalog has legs.
Namespaces were standardized in January 19998 as an addendum to XML 1.0. They use URI-based identifiers (which don't need to resolve to anything — they're just unique strings) and prefix syntax:
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:svg="http://www.w3.org/2000/svg">
<body>
<svg:svg width="100" height="100">
<svg:circle cx="50" cy="50" r="40"/>
</svg:svg>
</body>
</html>
Namespaces are conceptually simple but syntactically noisy. They add a lot of characters to documents, and the URI-as-identifier convention confuses people who try to actually visit those URLs.
The ecosystem that grew around XML
XML by itself is just a syntax. What made it genuinely useful in the 2000s was the ecosystem of standards built on top of it.
XSLT (XSL Transformations) is a declarative language for transforming XML documents into other XML documents, HTML, or plain text. It's a functional programming language masquerading as XML markup, which takes a while to get used to. XSLT 1.0 became a W3C Recommendation in November 19999, and XSLT 2.0 followed in January 2007 alongside XPath 2.0 and XQuery 1.0.
XPath is the query language for addressing parts of an XML document. Think CSS selectors but for XML, with the expressiveness of a real path language. XPath 1.0 was published in November 199910, and version 2.0 expanded it with a type system based on XML Schema.
XQuery is SQL for XML — a full query language for extracting and restructuring data from XML documents or databases. It shares its data model and type system with XPath 2.0, which keeps things consistent (if you already know one, you're halfway to knowing the other).
For parsing, two models emerged:
DOM (Document Object Model) loads the entire XML document into memory as a tree of nodes. You can navigate anywhere, read any element, modify the tree, and serialize it back. The trade-off is memory: a 500 MB XML file needs at least 500 MB of RAM, usually more.
SAX (Simple API for XML) is event-driven. The parser reads the document sequentially and fires callbacks — "start element," "end element," "text content" — as it encounters them. SAX uses almost no memory regardless of document size, but you can't jump around or look backward. You get one pass11.
Most modern languages offer both models, plus newer approaches like pull parsing (StAX in Java) and streaming APIs. Python's lxml, which Trafilatura uses internally, gives you both tree-based and event-based access.
XML's golden age: 2000-2010
Between roughly 2000 and 2010, XML was everywhere.
SOAP web services used XML for both the message format and the service description (WSDL). Enterprise Java was drowning in XML configuration files. A typical J2EE application might have web.xml, ejb-jar.xml, application.xml, struts-config.xml, and half a dozen more, all in different schemas. (Developers eventually rebelled against this — the annotation-over-configuration movement in Java was partly a reaction to XML fatigue.)
RSS became the standard for content syndication. Dave Winer published RSS 2.0 in September 200212, and at its peak, every blog, news site, and podcast had an RSS feed. The competing Atom format, standardized as RFC 4287 in 200513, was also XML. Google Reader, the most popular feed reader, shut down in 2013, and RSS never fully recovered as a mainstream technology — though it's still very much alive among a certain kind of user.
SVG (Scalable Vector Graphics) is XML for vector images. W3C published SVG 1.0 in September 200114, and SVG 1.1 followed in 2003. Browser support was inconsistent for years — IE required a plugin until IE9 — but SVG eventually became the default format for icons, logos, and scalable illustrations on the web.
XHTML was the attempt to reformulate HTML as well-formed XML. W3C published XHTML 1.0 as a Recommendation in January 2000, and for a few years, web standards advocates pushed everyone to write strict XHTML. The idea was seductive: if your web pages were valid XML, you could process them with XSLT, validate them with schemas, and mix in other XML vocabularies. In practice, a single unclosed tag would cause the parser to reject the entire page — which is great for data interchange but terrible for web pages authored by humans.
Office Open XML (OOXML) was standardized by Ecma International as ECMA-376 in 2006 and later as ISO/IEC 2950015. When you create a .docx file in Microsoft Word, you're actually creating a ZIP archive full of XML files describing the document's content, styles, relationships, and metadata. It's XML all the way down.
The JSON takeover
Around 2010, something shifted.
REST APIs started replacing SOAP services, and JSON — a data format derived from JavaScript object notation — turned out to be a much better fit for what most APIs actually do: send and receive structured data.
The reasons were pragmatic, not ideological. JSON is less verbose. A simple key-value pair in JSON is {"name": "value"} — in XML, it's <name>value</name> or worse, <entry key="name" value="value"/>. JSON maps directly to native data structures in JavaScript, Python, Ruby, and most modern languages. You don't need a special parser; JSON.parse() is built into every browser.
XML has real advantages over JSON for document-oriented use cases — mixed content (text interleaved with markup), attributes, namespaces, schema validation, XSLT transformations. But most web APIs don't send documents; they send records, lists, and nested objects. For that kind of data, JSON's lighter syntax wins.
The shift was fast. Twitter dropped its XML API response format in 2013. New APIs from startups almost universally chose JSON. By 2015, I'd estimate that fewer than 10% of new public web APIs offered an XML option at all.
But "JSON won the API wars" doesn't mean "XML is dead." Not even close.
Where XML still dominates
Plenty of domains never moved to JSON, because JSON isn't actually a good fit for them.
Document formats — The .docx, .xlsx, and .pptx files from Microsoft Office are XML inside ZIP archives (OOXML). LibreOffice's .odt, .ods, and .odp use ODF (Open Document Format), which is also XML-based. EPUB e-books contain XHTML content files and XML metadata. You interact with these formats every day, even if you never see the XML.
Build systems and configuration — Maven's pom.xml is the central configuration file for one of the most widely used Java build systems16. Every Android app has an AndroidManifest.xml declaring its components and permissions17. .csproj files in .NET, Info.plist files on macOS, SVG files on the web — all XML.
Publishing and technical documentation — DITA (Darwin Information Typing Architecture) is an OASIS standard for authoring modular technical content, and it's XML through and through. Scientific data formats like MathML and CML (Chemical Markup Language) are XML-based. Government procurement systems, healthcare data exchange (HL7 CDA), financial reporting (XBRL) — these industries adopted XML standards in the 2000s and have no compelling reason to rewrite everything in JSON.
Data interchange in enterprise — SOAP isn't glamorous, but it's still running in banks, insurance companies, and government systems. Migrating a SOAP service to REST+JSON is expensive and risky, and if the existing service works, there's no business case for the migration.
The pattern is clear: XML thrives where documents have complex structure, where schemas and validation matter, where content is mixed (text with inline markup), and where backwards compatibility with existing systems is more important than developer convenience.
XML in content extraction
This is where XML intersects with what contextractor does.
When you extract content from a web page — strip the navigation, sidebars, ads, cookie banners — you need to store the result somewhere, in some format. Plain text is the simplest option but throws away all structure. Markdown preserves some hierarchy (headings, lists) with minimal overhead. HTML keeps everything but includes a lot of noise if you're not careful.
XML sits at the structured end of the spectrum. It can encode both the extracted content and its metadata — the page title, author, publication date, source URL — in a single self-describing document.
Trafilatura's custom XML output
Trafilatura, the extraction engine that contextractor uses, defines its own XML schema for extracted content. It's not a standard like TEI or DocBook — it's a practical schema designed for web content extraction18.
The root element is <doc>, and it carries metadata as attributes:
<doc sitename="Example News"
title="Climate Report 2026"
author="Jane Smith"
date="2026-03-15"
url="https://example.com/climate-2026"
hostname="example.com"
description="Annual climate assessment"
categories="environment; science"
language="en">
<main>
<head rend="h1">Climate Report 2026</head>
<p>The global average temperature rose by...</p>
<head rend="h2">Regional Findings</head>
<list>
<item>Arctic: 2.1C above baseline</item>
<item>Tropics: 0.8C above baseline</item>
</list>
<table>
<row>
<cell>Region</cell>
<cell>Change</cell>
</row>
</table>
</main>
<comments>
<p>User comment content appears here...</p>
</comments>
</doc>
The <main> element holds the article body, and <comments> holds any user-generated comments found on the page. Inside these containers, you get semantic elements: <head> for headings (with a rend attribute indicating level), <p> for paragraphs, <list> and <item> for lists, <table>, <row>, and <cell> for tabular data, <quote> for block quotes, <ref> for links, <hi> for highlighted text, and <code> for code blocks.
This is more structured than Markdown and less verbose than full XML-TEI. It's a middle ground — enough structure to be machine-processable, compact enough to be practical.
How contextractor outputs XML
In contextractor, you can get XML output in several ways:
CLI — use the --xml flag or --output-format xml:
trafilatura --xml -u "https://example.com/article"
Python API — set output_format="xml":
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="xml")
Apify actor — the extracted XML can be saved to the key-value store using saveExtractedXmlToKeyValueStore.
Web playground — select "XML" from the output format options.
You can also control what gets included: --no-comments strips the <comments> section, --no-tables removes table extraction, --formatting preserves bold and italic markup, and --links includes hyperlink targets in the output.
For a comparison of all available formats — plain text, Markdown, HTML, JSON, XML, XML-TEI, and CSV — see the content formats for LLMs overview.
XML vs. XML-TEI
Both are XML, but they serve different purposes. Trafilatura's custom XML (the <doc> schema) is pragmatic — it gives you structured content with metadata in a compact format. XML-TEI follows the Text Encoding Initiative standard, an academic vocabulary for encoding texts that's been developed since 1987 by the TEI Consortium. TEI output includes a full <teiHeader> with bibliographic metadata, uses TEI-compliant element names, and can be validated against the TEI schema.
If you're building an extraction pipeline for an NLP project or feeding content into an LLM, the custom XML format is usually what you want — it's smaller and simpler. If you're working in digital humanities, corpus linguistics, or any field where TEI is the expected interchange format, use XML-TEI. The JSON format is typically the best choice when you need the metadata in a machine-readable structure but don't care about preserving inline markup within the text body.
The "XML is dead" narrative
People have been saying XML is dead since roughly 2012. That says more about the bubble most web developers live in than about XML's actual usage.
What really happened is that XML lost its position as the default format for everything. In the early 2000s, XML was the answer regardless of the question — configuration files, data interchange, web services, document storage, even programming (remember Ant build scripts?). That was never sustainable. JSON turned out to be better for data interchange over HTTP. YAML turned out to be more readable for configuration files (debatable, but the market decided). TOML found its niche. Protocol Buffers and MessagePack handle binary serialization.
But XML kept the domains where it genuinely excels: complex documents with mixed content, validated data interchange where schemas matter, and any system where a twenty-year-old standard can't be replaced just because something newer exists. Every Word document, every Android app, every Maven project, every EPUB book, every SVG icon on every website — that's all XML.
The format isn't going anywhere. It just stopped being the only tool in the box.
Citations
-
W3C: The World Wide Web Consortium Issues XML 1.0 as a W3C Recommendation. Retrieved April 14, 2026 ↩
-
W3C: Extensible Markup Language (XML) 1.0 (Fifth Edition). Retrieved April 14, 2026 ↩ ↩2
-
Library of Congress: Standard Generalized Markup Language (SGML), ISO 8879:1986. Retrieved April 14, 2026 ↩
-
W3C: XML Development History. Retrieved April 14, 2026 ↩
-
W3C: The World Wide Web Consortium Issues XML 1.0 as a W3C Recommendation. Retrieved April 14, 2026 ↩ ↩2
-
School of Information Science, University of Pittsburgh: Jon Bosak — Hall of Fame. Retrieved April 14, 2026 ↩
-
W3C: W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures. Retrieved April 14, 2026 ↩
-
W3C: Namespaces in XML 1.0 (Third Edition). Retrieved April 14, 2026 ↩
-
W3C: XSL Transformations (XSLT) Version 1.0. Retrieved April 14, 2026 ↩
-
W3C: XML Path Language (XPath) Version 1.0. Retrieved April 14, 2026 ↩
-
SAX Project: SAX — Simple API for XML. Retrieved April 14, 2026 ↩
-
RSS Advisory Board: RSS 2.0 Specification. Retrieved April 14, 2026 ↩
-
IETF: RFC 4287 — The Atom Syndication Format. Retrieved April 14, 2026 ↩
-
W3C: Scalable Vector Graphics (SVG) 1.0 Specification. Retrieved April 14, 2026 ↩
-
Library of Congress: OOXML Format Family — ISO/IEC 29500 and ECMA 376. Retrieved April 14, 2026 ↩
-
Apache Maven: POM Reference. Retrieved April 14, 2026 ↩
-
Android Developers: App manifest overview. Retrieved April 14, 2026 ↩
-
Trafilatura: On the command-line — Trafilatura documentation. Retrieved April 14, 2026 ↩
Updated: April 14, 2026