XML-TEI explained

XML-TEI is an XML vocabulary defined by the Text Encoding Initiative (TEI) for encoding texts in the humanities, social sciences, and linguistics. If you've built NLP pipelines or worked with web extraction tools, you've probably seen output format options like "plain text," "JSON," "Markdown" -- and then this one mysterious option called "XML-TEI" or "xmltei" that nobody in a typical engineering team picks. It's there for a reason, though. The TEI standard has been around since the late 1980s, predating both XML itself and the World Wide Web, and it remains the dominant encoding scheme for scholarly text corpora.

Contextractor supports XML-TEI as one of its output formats, because the extraction engine underneath -- Trafilatura -- was built by a corpus linguist who needed exactly this kind of structured scholarly output1.

What TEI actually is

The TEI isn't a file format in the way most developers think of file formats. It's a set of guidelines -- a huge, modular specification that describes how to represent texts and their metadata in XML2. The full TEI P5 specification defines 588 elements organized into modules like "core," "header," "textstructure," "drama," "transcription," and many others3. You don't use all 588 for every document. Projects pick the modules and elements relevant to their domain, often creating a custom schema (called an ODD -- "One Document Does it all") that constrains the full TEI to just what they need.

Think of it as XML Schema on steroids, specifically designed for texts rather than arbitrary data interchange.

A minimal TEI document looks something like this:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Article Title Here</title>
      </titleStmt>
      <publicationStmt>
        <p>Published by contextractor</p>
      </publicationStmt>
      <sourceDesc>
        <p>Extracted from https://example.com/article</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <div type="entry">
        <p>The actual extracted content goes here.</p>
      </div>
    </body>
  </text>
</TEI>

Even the minimal version is verbose. The <teiHeader> alone requires <fileDesc>, which in turn requires three child elements: <titleStmt>, <publicationStmt>, and <sourceDesc>4. That's the price of structured bibliographic metadata -- you can't just slap some text into a file and call it TEI.

Where it came from

The Poughkeepsie meeting (1987)

In November 1987, thirty-two scholars gathered at Vassar College in Poughkeepsie, New York, for a two-day conference about a problem that was annoying everyone in computational humanities: there was no standard way to encode texts for research5. Every project used its own format. Sharing texts between institutions meant writing custom conversion scripts every time. The meeting was sponsored by the Association for Computers and the Humanities (ACH) and funded by the U.S. National Endowment for the Humanities.

That meeting produced the Poughkeepsie Principles -- a set of design goals stating that the community needed a standard format for data interchange in humanities research, with encoding conventions suited for various applications, minimal baseline requirements, and compatibility with existing standards6. The actual development work was then organized by three sponsoring bodies: the ACH, the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC).

Guidelines versions: P1 through P5

Work moved fast, at least by academic standards. A first draft, TEI P1, shipped in June 19905. "P" stands for "Proposal" -- the early versions were explicitly framed as drafts for community review, not finished specs.

TEI P2 followed between 1990 and 1993, developed by fifteen working groups that produced detailed recommendations for everything from verse encoding to dictionaries to spoken language transcription.

TEI P3, released in May 1994, was the first version people actually used in production. It was an SGML application -- XML didn't exist yet (the W3C published the XML 1.0 recommendation in February 1998). P3 became the de facto standard for digital text encoding in the humanities, and projects built on it for nearly a decade.

TEI P4 (June 2002) was essentially P3 re-expressed in XML. No major structural changes -- just the SGML-to-XML migration that the rest of the world had already gone through7.

TEI P5, released in November 2007, was the real overhaul. It introduced new modules for manuscript description, character encoding, graphics, and standoff annotation. It's still the current major version -- as of February 2026, the latest release is P5 version 4.11.08. The TEI Consortium publishes updates roughly every six months.

The TEI Consortium

For the first decade, the TEI was a project, not an organization. In January 1999, the University of Virginia and the University of Bergen proposed creating a formal membership body -- the TEI Consortium -- to maintain and develop the guidelines long-term9. Brown University and Oxford University joined as host institutions. Incorporation completed in December 2000, and the first board took office in January 2001.

The Consortium is still the governing body today, funded by institutional memberships and individual subscribers. It's a non-profit, academically independent organization -- which matters because it means the TEI standard isn't controlled by any single vendor or university.

The teiHeader: metadata done properly

The <teiHeader> is what separates TEI from ordinary XML markup. It's a structured bibliographic description of the document that answers questions like: who wrote this? When? Where was it published? What's the source material? Under what license?

The header's backbone is <fileDesc>, which is mandatory and contains three required children4:

  • <titleStmt> -- the title of the work, plus author, editor, funder, and other responsibility statements
  • <publicationStmt> -- who published or distributed the electronic text, under what conditions, and with what identifiers
  • <sourceDesc> -- where the text came from (a print edition, a manuscript, a website, or "born digital")

Beyond <fileDesc>, the header can also include:

  • <encodingDesc> -- how the text was encoded, what tools were used, what editorial principles were followed
  • <profileDesc> -- non-bibliographic information like language, subject keywords, and abstract
  • <revisionDesc> -- a change log for the document

When Trafilatura generates TEI output, it fills the <teiHeader> with whatever metadata it managed to extract from the web page: title, author, publication date, categories, tags, license information, and the source URL10. The <encodingDesc> element even records that Trafilatura was the application that produced the output.

This is genuinely useful if you're building a corpus. The metadata travels with the text, in a standardized structure that any TEI-aware tool can parse.

Body elements: semantic text markup

Inside <text><body>, TEI gives you elements that map loosely to HTML but carry different semantics. The key ones Trafilatura uses10:

  • <p> -- paragraph (same as HTML, but in the TEI namespace)
  • <head> -- heading (not the HTML <head> element -- this is a section heading inside the body)
  • <list> and <item> -- lists and list items
  • <table>, <row>, <cell> -- tables (similar to HTML's <table>, <tr>, <td>)
  • <ref> -- a reference or link (analogous to HTML <a>, with target attribute instead of href)
  • <div> -- a text division, typically with a type attribute like type="entry" or type="comments"
  • <hi> -- highlighted text (bold, italic -- the specific rendering is indicated by rend or rendition attributes)
  • <quote> -- a quotation
  • <code> -- code fragment
  • <graphic> -- image reference

The valid attributes are limited to rend, rendition, role, target, and type. That's it. No class, no id, no style -- TEI deliberately strips away presentational concerns.

One quirk worth noting: Trafilatura's validation step converts <head> elements (section headings) into <ab type="header"> to pass strict TEI validation, since <head> has placement constraints in the TEI schema that extracted web content doesn't always respect10.

Digital humanities: who uses this and why

If you're a web developer, you might wonder who actually needs all this structure. The answer is: a lot of people, mostly in academia.

Literary corpora -- The Deutsches Textarchiv (German Text Archive), the European Literary Text Collection, the DraCor drama corpus, TextGrid's digital library, and the French Theatre classique all use TEI as their encoding format11. When you're building a searchable collection of thousands of literary texts spanning centuries, you need consistent metadata and structural markup. TEI provides that.

Historical document encoding -- Manuscripts, letters, inscriptions, early printed books. The TEI's manuscript description module (msDesc) handles physical descriptions (parchment type, binding, condition) alongside textual content. Projects like the Women Writers Project at Brown University (later Northeastern) have been encoding early modern women's texts in TEI since the early 1990s12.

Corpus linguistics -- Building large text collections for linguistic research. Adrien Barbaresi, the creator of Trafilatura, works at the Berlin-Brandenburg Academy of Sciences on the DWDS and ZDL digital lexicography projects -- both of which involve constructing web corpora in TEI format1. That's literally why Trafilatura has TEI output: the developer needed it for his day job.

Digital editions -- Scholarly editions of texts that need to represent editorial interventions, variant readings, annotations, and apparatus. TEI handles all of this with dedicated modules.

The common thread is that these projects treat text as data -- not just something to display on screen, but something to query, analyze, compare, and preserve for decades.

Why TEI for web extraction

Most web extraction use cases don't need TEI. If you're feeding extracted content into an LLM, plain text or Markdown is cheaper and more practical. But there are specific scenarios where TEI earns its token cost.

Building scholarly corpora from web sources. If your downstream tools expect TEI -- and many digital humanities tools do (TXM, Voyant Tools, eXist-db, TEI Publisher) -- then extracting directly into TEI saves a conversion step. Trafilatura's output won't be as richly annotated as hand-encoded TEI, but it gives you a valid starting point with correct header structure.

Metadata preservation. The <teiHeader> carries bibliographic metadata in a standardized format. Title, author, date, categories, tags, source URL, license -- all structured and parseable. JSON does this too, but TEI's header is a standard that humanities tools already understand.

Validation. You can check whether the output actually conforms to the TEI schema. Try that with Markdown.

Archival use. TEI documents are self-describing. Twenty years from now, any TEI-aware parser can still read them. Plain text files lose their metadata; JSON schemas drift; Markdown has no formal spec (CommonMark exists, but adoption is uneven). TEI has institutional backing and backward compatibility going back to 1994.

TEI validation in Trafilatura

Trafilatura includes a built-in TEI validator. In Python:

from trafilatura import fetch_url, extract

downloaded = fetch_url("https://example.com/article")
result = extract(downloaded, output_format="xmltei", tei_validation=True)

The tei_validation=True flag tells Trafilatura to check the generated XML against the TEI schema before returning it13. If validation fails, the function returns None rather than invalid TEI. You can also validate after the fact:

from trafilatura.xml import validate_tei

is_valid = validate_tei(tei_document)

This returns True for valid documents or a string describing the first validation error13.

On the command line, use --xmltei with --validate-tei:

trafilatura --xmltei --validate-tei --URL "https://example.com/article"

The validation catches structural problems -- missing required elements, elements in wrong positions, invalid attributes. It won't catch semantic issues like wrong metadata values, but it guarantees the output is at least well-formed TEI that other tools can parse without choking.

The token cost problem

Here's the honest trade-off: TEI is expensive in tokens.

The format comparison breaks this down in detail, but the short version is that XML-TEI nearly doubles the token count compared to plain text. A 1,500-word article that tokenizes to ~2,000 tokens as plain text will run about ~3,800 tokens as TEI. That overhead comes from two sources:

The <teiHeader> block itself takes 200-400 tokens depending on how much metadata Trafilatura extracted. That's pure overhead -- metadata that exists once per document regardless of content length.

Then the body markup adds tokens for every structural element. <p> and </p> around each paragraph, <head> tags, <item> tags in lists, <cell> and <row> in tables. TEI element names are spelled out in full -- no shorthand, no compression.

For AI pipelines, this overhead usually isn't worth it. If you're chunking documents for RAG, the TEI structure gets broken across chunk boundaries anyway, and the header metadata is only in the first chunk. Markdown gives you heading hierarchy and list structure at roughly 10% overhead. JSON gives you metadata in a machine-parseable wrapper at about 40% overhead.

TEI's token cost makes sense when the downstream consumer is a TEI-aware tool, not an LLM.

How contextractor outputs XML-TEI

Contextractor exposes Trafilatura's TEI format through all its interfaces.

Python / Trafilatura directly:

result = extract(downloaded, output_format="xmltei")

Command line (Trafilatura CLI):

trafilatura --xmltei --URL "https://example.com"

To process a directory of saved HTML files into a TEI corpus:

trafilatura --input-dir download/ --output-dir corpus/ --xmltei --no-comments

Apify actor: Set saveExtractedXmlTeiToKeyValueStore to true in the actor input. The extracted TEI document gets saved to the key-value store, and the dataset output includes a URL to retrieve it.

The file extension for TEI output is .xml -- not .tei.xml as you might expect. Trafilatura doesn't distinguish between its custom XML schema output and TEI output at the file extension level, which is mildly annoying when you have both in the same directory.

What Trafilatura's TEI output actually looks like

Here's a realistic example of what you get when extracting a web article with output_format="xmltei":

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title type="main">How CSS Container Queries Work</title>
        <author>Sarah Chen</author>
      </titleStmt>
      <publicationStmt>
        <publisher>Web Dev Blog</publisher>
        <availability>
          <licence>CC BY 4.0</licence>
        </availability>
      </publicationStmt>
      <sourceDesc>
        <bibl>
          <ref type="URL">https://example.com/css-container-queries</ref>
        </bibl>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <abstract>
        <p>An introduction to CSS container queries and how they
        differ from media queries.</p>
      </abstract>
      <textClass>
        <keywords>
          <term>CSS</term>
          <term>container queries</term>
          <term>responsive design</term>
        </keywords>
      </textClass>
    </profileDesc>
    <encodingDesc>
      <appInfo>
        <application ident="trafilatura" version="2.0.0">
          <label>Trafilatura</label>
        </application>
      </appInfo>
    </encodingDesc>
  </teiHeader>
  <text>
    <body>
      <div type="entry">
        <ab type="header">Container queries vs. media queries</ab>
        <p>Media queries respond to the viewport.
        Container queries respond to a parent element.</p>
        <p>That distinction sounds small, but it changes how you
        architect component-based layouts.</p>
        <list>
          <item>Media queries: global viewport width</item>
          <item>Container queries: local container width</item>
        </list>
      </div>
    </body>
  </text>
</TEI>

Notice the <ab type="header"> instead of <head> for the section heading -- that's the validation workaround mentioned earlier. The <profileDesc> contains keywords extracted from the page's meta tags or article tags. The <encodingDesc> records Trafilatura as the producing application.

This output is valid TEI P5. You can load it into eXist-db, process it with Saxon, index it with TXM, or feed it to any other tool in the TEI ecosystem without modification.

When to pick TEI over other formats

Pick TEI when:

  • Your downstream toolchain expects it (corpus analysis tools, digital edition platforms, XML databases)
  • You need self-documenting files with embedded, standardized metadata
  • You're building a text corpus for humanities or linguistic research
  • Archival longevity matters -- you want files that are readable in 20 years without depending on any specific software
  • You need validation -- the ability to programmatically verify that your output conforms to a schema

Pick something else when:

  • You're feeding text to an LLM (use plain text or Markdown)
  • Token cost matters (TEI nearly doubles your count)
  • You need metadata but not body structure (use JSON)
  • Your team doesn't know XML and doesn't want to learn it

TEI is a specialized format for specialized needs. That doesn't make it niche in the dismissive sense -- it's the dominant standard in its domain, backed by nearly four decades of development and an active consortium. But if you're building a chatbot or a RAG pipeline, you don't need it, and using it anyway just wastes context window space.

Citations

  1. Barbaresi, A.: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL 2021: System Demonstrations ↩ ↩2

  2. TEI Consortium: TEI: Text Encoding Initiative. Retrieved April 14, 2026 ↩

  3. TEI Consortium: Appendix C Elements. TEI P5 Guidelines. Retrieved April 14, 2026 ↩

  4. TEI Consortium: 2 The TEI Header. TEI P5 Guidelines. Retrieved April 14, 2026 ↩ ↩2

  5. TEI Consortium: History. Retrieved April 14, 2026 ↩ ↩2

  6. TEI Consortium: Poughkeepsie Principles. Retrieved April 14, 2026 ↩

  7. TEI Consortium: iv. About These Guidelines. TEI P5 Guidelines. Retrieved April 14, 2026 ↩

  8. TEI Consortium: P5 Guidelines. Retrieved April 14, 2026 ↩

  9. TEI Consortium: A Consortium Proposal for TEI. Retrieved April 14, 2026 ↩

  10. Barbaresi, A.: trafilatura/xml.py. GitHub. Retrieved April 14, 2026 ↩ ↩2 ↩3

  11. CLS INFRA: Corpus Building for Literary History. Survey of Methods in Computational Literary Studies. Retrieved April 14, 2026 ↩

  12. Northeastern University: WWP History. Women Writers Project. Retrieved April 14, 2026 ↩

  13. TEI Consortium: Trafilatura. TEI Wiki. Retrieved April 14, 2026 ↩ ↩2

Updated: April 14, 2026