Plain text explained

Plain text is text with no formatting, no structure, and no metadata attached. No bold, no headings, no font sizes — just a sequence of characters. When you strip a web page down to its content and throw away the HTML tags, the CSS, the JavaScript, and the cookie banners, what remains is plain text.

That sounds trivially simple, and in a way it is. But the definition hides decades of engineering decisions about which characters are available, how they're stored as bytes, and what counts as a "character" in the first place. The story of plain text is really the story of character encoding — and that story is messier than you'd expect.

128 characters should be enough for anyone

The modern notion of plain text starts with ASCII — the American Standard Code for Information Interchange. Work on the standard began on October 6, 1960, when the American Standards Association (ASA) convened its X3.2 subcommittee, and the first version was published as ASA X3.4 on June 17, 19631.

ASCII is a 7-bit encoding. Seven bits give you 128 possible values (2^7), and the committee assigned meanings to all of them: 33 control characters (things like carriage return, line feed, bell) and 95 printable characters. Those 95 printable characters covered uppercase and lowercase Latin letters, digits 0-9, punctuation, and a handful of symbols like @, #, and &.

The 7-bit constraint wasn't arbitrary — it reflected the hardware reality of the 1960s. Telecommunications equipment worked with 7-bit data units, and the extra bit in an 8-bit byte was often used for parity checking (error detection during transmission)2. The entire Internet protocol stack was built on this assumption. RFC 20, published in 1969, established ASCII as the standard format for network interchange, and that decision echoed through every protocol that followed2.

There's one thing the original 1963 version didn't include: lowercase letters. Those were added in the 1967 revision (X3.4-1967). It's a funny detail — the first ASCII standard only had uppercase, which makes it look oddly like a COBOL program.

The code page chaos

128 characters work fine if you only speak English. The moment you need an accented e (e), a German u-umlaut (u), or a yen sign, you're out of luck.

Hardware manufacturers noticed this problem almost immediately. The IBM PC, released in August 1981, shipped with Code Page 437 — an 8-bit character set that used all 256 values of a full byte3. The lower 128 values matched ASCII; the upper 128 added accented Latin characters, Greek letters, box-drawing symbols, and some mathematical notation. The box-drawing characters (those single and double lines for making tables in a terminal) became iconic — if you used DOS in the 1980s or early 90s, you know exactly what they look like.

The problem was that IBM's solution wasn't the only one. Different manufacturers and different regions created their own 8-bit extensions. ISO 8859-1 (Latin-1) covered Western European languages. ISO 8859-2 handled Central European ones. Windows had Windows-1252, which was almost-but-not-quite identical to ISO 8859-1 (Microsoft filled in the gap between code points 128-159 with characters like curly quotes and the euro sign, where ISO 8859-1 had undefined control codes). The Soviet Union had KOI8-R for Russian. Japan had Shift_JIS and EUC-JP. China had GB2312 and later GBK.

Each of these was its own little universe. A file encoded in Windows-1252 and opened as ISO 8859-2 would turn accented French characters into garbled Polish ones. The Japanese encodings were multi-byte — some characters took one byte, others took two — which made string processing a minefield. You couldn't even reliably count the number of characters in a string without knowing the exact encoding.

This era, roughly from 1981 to the mid-2000s, is what I'd call the encoding wars. Every operating system, every application, every email client had to guess which encoding a particular piece of text was using. They guessed wrong constantly. The result was mojibake — a Japanese term for garbled text that became the universal name for the problem, which is fitting given how badly Japanese text suffered from encoding mismatches.

Unicode: one table to rule them all

The idea behind Unicode is simple: assign a unique number to every character in every writing system on Earth. One master table instead of hundreds of competing ones.

The project started in the late 1980s at Xerox, where Joe Becker began investigating the feasibility of a universal character set. He coined the name "Unicode" in December 1987 — a blend of "universal" and "code." By February 1988, Apple engineers Lee Collins and Mark Davis had drafted a principles document that laid out the core architecture4. The Unicode Consortium was officially incorporated on January 3, 1991, in California, and the first volume of The Unicode Standard, Version 1.0, was published in October of that year5.

That first version encoded 7,161 characters. As of Unicode 16.0, released September 10, 2024, the standard defines 154,998 characters covering 168 scripts6. It includes everything from Latin and Cyrillic to Egyptian hieroglyphs, Braille patterns, musical notation, and — yes — emoji.

A common misconception is that Unicode is an encoding. It isn't. Unicode is a character set — a table that maps abstract characters to numbers (called code points). The code point for the Latin letter "A" is U+0041. The code point for the Japanese hiragana character "a" is U+3042. The code point for the face-with-bags-under-eyes emoji is U+1FAE9. How those code points get stored as actual bytes in a file — that's the job of an encoding form, and Unicode defines three of them: UTF-8, UTF-16, and UTF-32.

UTF-8: designed on a placemat

The story of UTF-8 is one of the best origin stories in computing.

In September 1992, Ken Thompson and Rob Pike were working on Plan 9, the research operating system at Bell Labs. They had been using the original UTF encoding from ISO 10646 to add Unicode support, but they weren't happy with it. Then they got a phone call from IBM representatives at an X/Open committee meeting, who wanted them to review an alternative encoding proposal called FSS/UTF7.

Thompson and Pike went to a diner in New Jersey that evening. Over dinner, Thompson sketched out a new encoding scheme on a placemat. The design used variable-width sequences of 1 to 4 bytes per character, was fully backwards-compatible with ASCII (any valid ASCII text is also valid UTF-8), and had a self-synchronizing property that let you pick up a byte stream in the middle and find the start of the next character within a few bytes7.

That night, Thompson wrote the encoding and decoding routines. Pike modified the C library and graphics code. By the end of the week, Plan 9 was running — and only running — what would become known as UTF-8. They sent their design back to the X/Open committee, who accepted it7.

The whole thing took about a week from placemat to production system. Pike later said he wished they'd kept the placemat.

Why UTF-8 won

UTF-8 has three properties that made it unstoppable:

Backwards compatibility with ASCII — Any ASCII text file (code points 0-127) is byte-for-byte identical in UTF-8. A program that only understands ASCII will read a UTF-8 file just fine, as long as the content happens to be pure ASCII. This meant existing systems didn't need to be rewritten to start supporting UTF-8; they could adopt it incrementally.

No byte-order issues — UTF-8 works one byte at a time, so there's no question about whether the first byte of a pair is the "big end" or "little end." UTF-16 and UTF-32 have this problem, which is why they need a Byte Order Mark (BOM) to signal endianness8. UTF-8 doesn't need a BOM at all, though some software (notably Windows Notepad, for years) insists on adding one anyway.

Compact for Latin text — English and most Western European text uses 1 byte per character in UTF-8, the same as ASCII. Cyrillic, Greek, Arabic, and Hebrew use 2 bytes. CJK characters (Chinese, Japanese, Korean) use 3 bytes. Emoji and rare scripts use 4. This variable-width approach means UTF-8 is storage-efficient for the text that makes up the bulk of the web8.

The numbers speak for themselves. As of early 2026, UTF-8 is used by 98.9% of all websites surveyed by W3Techs, and 99.9% of the top 1,000 sites9. The second most common encoding is ISO 8859-1, at 1.0%. UTF-8 crossed the 50% mark around 2010 and has been essentially the only encoding that matters on the web since roughly 2016.

UTF-16 and UTF-32

UTF-16 uses one or two 16-bit code units per character. Characters in the Basic Multilingual Plane (U+0000 to U+FFFF, which covers most commonly used characters) take 2 bytes. Characters outside it — including emoji and historic scripts — require a surrogate pair of 4 bytes8.

UTF-16 is the native string encoding in Java, JavaScript, and Windows. If you've ever seen a JavaScript string with \uD83D\uDE00 instead of a smiley face, that's a UTF-16 surrogate pair leaking through the abstraction.

UTF-32 uses exactly 4 bytes for every character, no exceptions. It's simple to index (character N is always at byte offset N*4) but wasteful — English text takes four times the space it would in UTF-8. It sees some use in internal processing on Unix systems, but you'll rarely encounter UTF-32 files in the wild.

On the web, neither UTF-16 nor UTF-32 has any meaningful share. They're used internally by runtimes and operating systems, but the wire format is UTF-8.

The MIME type and other practicalities

When plain text travels over HTTP or sits on disk, it carries a MIME type: text/plain. RFC 2046 defines it as the simplest subtype of "text," intended to be displayed as-is with no interpretation of embedded formatting10. If no character set is specified, the historical default is US-ASCII, though in practice modern systems assume UTF-8.

A .txt file extension is the convention for plain text files on disk. It tells the operating system to open the file in a text editor rather than trying to parse it as a document format. The extension itself carries no encoding information — a .txt file could be ASCII, UTF-8, UTF-16, or Windows-1252. The actual encoding has to be detected or declared separately.

Line endings

One of the most persistent annoyances in plain text is the line ending problem. It traces back to physical teletypes, where moving to the next line required two operations: a carriage return (CR, moving the print head back to the left margin) and a line feed (LF, advancing the paper by one line)10.

Different operating systems inherited different conventions:

  • Windows: CR+LF (\r\n, bytes 0x0D 0x0A)
  • Unix/Linux/macOS: LF only (\n, byte 0x0A)
  • Classic Mac OS (pre-OS X): CR only (\r, byte 0x0D)

Unix adopted LF-only because the Multics operating system recognized that two characters for a newline was wasteful and moved the carriage-return logic into device drivers. Unix followed Multics, Linux followed Unix, and macOS switched from CR to LF when Apple moved to a Unix base with OS X in 2001.

This split causes real problems. Open a Windows text file in a bare-bones Unix editor and you get ^M characters at the end of every line. Open a Unix file in old versions of Notepad and the entire file shows up as one long line. Git has an entire configuration option (core.autocrlf) dedicated to managing this mismatch.

The BOM nuisance

The Byte Order Mark (BOM) is the Unicode character U+FEFF, which in UTF-8 encodes as the three bytes EF BB BF. For UTF-16, the BOM serves an actual purpose: it signals whether the byte order is big-endian or little-endian. For UTF-8, it's meaningless — UTF-8 has no byte-order ambiguity.

The Unicode Standard permits a BOM in UTF-8 but doesn't recommend it8. The IETF goes further and says protocols that always use UTF-8 "SHOULD forbid use of U+FEFF as a signature."

And yet, Windows software has historically added a UTF-8 BOM to every file it saves. This breaks Unix shell scripts (the #! shebang line doesn't work if there are invisible bytes before it), confuses some CSS and JavaScript parsers, and generally creates the kind of silent, hard-to-debug problems that waste hours.

Plain text in data pipelines

Strip away the encoding history, and plain text has a remarkable property: zero structural overhead. Every character, every token carries semantic meaning. There are no tags to parse, no attributes to skip, no delimiters to account for.

This makes plain text the optimal format for several classes of data processing:

Embedding pipelines — Models like OpenAI's text-embedding-3-large or Cohere's embed-v3 don't understand HTML tags or Markdown syntax. They treat <h2> as a sequence of characters, not as a heading marker. For vector search and similarity matching, plain text means every token contributes to the semantic representation. The format comparison article breaks down the token cost differences in detail.

Classification and sentiment analysis — If you're routing documents by topic or scoring them for tone, formatting carries no useful signal. The ## before a heading and the ** around a bold word add tokens without adding meaning.

Token efficiency — In a plain text extraction of a typical web article, there's roughly zero markup overhead. Markdown adds about 10% more tokens (for ##, -, [text](url) link syntax). HTML adds around 50%. XML-TEI can nearly double the token count. When you're stuffing multiple retrieved documents into an LLM context window, those percentages add up fast.

That said, plain text isn't always the right choice. For retrieval-augmented generation (RAG) where the model needs to understand document structure — which parts are headings, which are lists, where one section ends and another begins — Markdown preserves that structure at a modest token cost. And for pipelines that need metadata (title, author, publication date), JSON wraps the text in a structured envelope that's machine-parseable without regex.

The choice depends on what happens downstream. For pure embedding and classification, plain text wins. For anything structural, you pay a small token tax for Markdown or HTML.

How Contextractor extracts plain text

Contextractor extracts web content using Trafilatura, a Python library that strips boilerplate, navigation, ads, and cookie banners from web pages, leaving the main content. By default, Trafilatura's output is plain text — it's the simplest format and doesn't require any format flag11.

Under the hood, Trafilatura uses heuristic scoring (text density, link density) with a fallback chain of extraction algorithms. The best result wins. When the output format is txt, headings lose their hierarchy, links lose their URLs, and tables flatten into space-separated values. What remains is a clean sequence of paragraphs — just the words, in reading order.

CLI

With the Contextractor CLI (installed via pip install contextractor), use the --save text flag:

contextractor https://example.com --save text

This writes a .txt file to the output directory. You can combine multiple formats in one run:

contextractor https://example.com --save text,markdown,json

Apify actor

On the Apify platform, the saveExtractedTextToKeyValueStore input field controls plain text output. Set it to true and the extracted text gets stored as a separate key-value store entry alongside whatever other formats you've enabled.

Playground

In the Contextractor web playground, check the "Plain text" checkbox under output formats before running the extraction. The extracted text appears in the output panel and can be copied or downloaded as a .txt file.

Plain text isn't going away

There's something almost paradoxical about the trajectory here. We went from 128 ASCII characters on 7-bit hardware, through decades of incompatible 8-bit extensions, to a universal standard with nearly 155,000 characters encoded across three different byte representations — and yet the simplest, most useful form of all that complexity is still just: characters in a row, no formatting, no markup.

Plain text predates HTML by three decades. It predates the web, the Internet, Unix, and the C programming language. It'll outlast most of the formats we use today. A .txt file from 1970 is still readable on any modern system, assuming you know the encoding (and if it's ASCII, you don't even need to think about that). Try saying the same about a WordPerfect document or a Flash animation.

For content extraction, plain text is the lowest common denominator in the best possible sense. Every downstream tool can consume it. Every programming language can parse it (there's nothing to parse). Every LLM can process it with zero wasted tokens. It's not always the best format — sometimes you need structure, sometimes you need metadata — but it's always a safe starting point.

Citations

  1. IEEE History Center: American Standard Code for Information Interchange ASCII, 1963. Retrieved April 14, 2026 ↩

  2. RFC 20: ASCII format for Network Interchange. Retrieved April 14, 2026 ↩ ↩2

  3. IBM: Code Page 437. Retrieved April 14, 2026 ↩

  4. Unicode Consortium: History of Unicode. Retrieved April 14, 2026 ↩

  5. Unicode Consortium: Version One Chronology. Retrieved April 14, 2026 ↩

  6. Unicode Consortium: Unicode 16.0.0. Retrieved April 14, 2026 ↩

  7. Rob Pike: UTF-8 history. Retrieved April 14, 2026 ↩ ↩2 ↩3

  8. Unicode Consortium: FAQ — UTF-8, UTF-16, UTF-32 & BOM. Retrieved April 14, 2026 ↩ ↩2 ↩3 ↩4

  9. W3Techs: Usage Statistics of UTF-8 for Websites. Retrieved April 14, 2026 ↩

  10. RFC 2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types. Retrieved April 14, 2026 ↩ ↩2

  11. Trafilatura: Python Usage Documentation. Retrieved April 14, 2026 ↩

Updated: April 14, 2026