Markdown Explained — Lightweight Text Formatting

Markdown is a lightweight markup language that uses plain-text formatting conventions — hash signs for headings, asterisks for emphasis, square brackets for links — to produce structured documents that can be converted to HTML and other formats. It was created in 2004 by John Gruber, with substantial design input from Aaron Swartz, and released as a Perl script on Gruber's blog Daring Fireball¹.

The original pitch was simple: a format that's readable before conversion. You should be able to open a Markdown file in Notepad and understand the structure without any rendering. That property separates it from HTML, where <h2>Section title</h2> is parseable by machines but ugly for humans, and from WYSIWYG editors where the plain-text source is meaningless binary or XML.

Twenty-two years later, Markdown is everywhere. GitHub, Reddit, Stack Overflow, Discord, Slack, Notion, Obsidian, Jira, Confluence — the list keeps growing. But the reason Markdown matters right now, in 2026, has less to do with blogging and more to do with large language models. It turns out the format that was designed to be readable by humans is also remarkably efficient for machines.

The origin story

John Gruber was a blogger and former Apple employee running Daring Fireball, a site about Apple and technology. Aaron Swartz was a 17-year-old programmer who had already co-authored the RSS 1.0 specification and would later co-found Reddit². They collaborated on the Markdown syntax during early 2004, with Swartz serving as what Gruber called the "sole beta-tester" — though his contributions went well beyond testing. Swartz wrote html2text, a Python tool that converted HTML back into Markdown, which effectively validated the format's round-trip capability¹.

Gruber announced Markdown on March 15, 2004, alongside Markdown.pl — a Perl script that converted Markdown to HTML using a chain of regular expression substitutions. The last update to Markdown.pl was version 1.0.1, released on December 17, 2004¹. That's it. The canonical implementation hasn't been touched in over two decades.

Gruber described Markdown as "a text-to-HTML conversion tool for web writers" and released it under a BSD-style open source license. He deliberately kept the syntax description informal — more of a guide than a specification. That decision would cause problems later.

Fragmentation

Markdown.pl was a reference implementation, not a rigorous standard. Gruber's syntax description left many edge cases unspecified. What happens when you nest a blockquote inside a list inside another list? How many spaces constitute indentation for a sublist? Can you have blank lines between list items without breaking the list? Gruber's document didn't say, or said something ambiguous, or the Perl script did one thing while the description implied another.

So every new implementation — and there were dozens — made its own choices.

PHP Markdown, Python-Markdown, Pandoc's Markdown parser, GitHub's fork, Reddit's parser, Stack Overflow's parser — they all handled edge cases differently. A document that rendered correctly on GitHub might break on Stack Overflow. Nested lists were a particular disaster: Markdown.pl itself was inconsistent, requiring two spaces of indentation for a first-level sublist but three spaces for a sublist nested one level deeper³.

By 2012, the situation was bad enough that Jeff Atwood — co-founder of Stack Overflow and Discourse — wrote a public blog post calling on Gruber to support a standardization effort. The community had been "accidentally fragmenting Markdown while popularizing it," as Atwood put it⁴.

Gruber didn't respond. Not to the blog post, not to the email Atwood's group sent in November 2012, and not to the follow-up email in August 2014 with a link to their draft specification. They took the silence as implicit approval and launched the project on September 3, 2014, under the name "Standard Markdown."

Gruber responded within hours. He called the name "infuriating" and demanded three things: rename the project, shut down the domain, and apologize⁴. Two days later, the project was renamed to CommonMark.

CommonMark: the specification Markdown never had

CommonMark is a formal specification of Markdown syntax written by John MacFarlane, a philosophy professor at UC Berkeley who also created Pandoc³. MacFarlane had written Markdown parsers in multiple languages and knew better than most where the ambiguities lay. The spec, currently at version 0.31.2 (January 2024), contains over 600 examples that serve as conformance tests³.

Where Gruber's description said "indent by 4 spaces or 1 tab" and left it at that, CommonMark spells out exactly what happens with 1, 2, 3, or 5 spaces, with tabs in different positions, with mixed tabs and spaces, in the context of lists, blockquotes, and code blocks. It's the difference between a recipe that says "cook until done" and one that gives you a temperature and a timer.

The effort involved people from GitHub, GitLab, Reddit, Stack Exchange, and Discourse. Reference implementations exist in C (cmark) and JavaScript (commonmark.js), both written by MacFarlane³. Most modern Markdown parsers — including the ones used by GitHub, VS Code, and Obsidian — are now CommonMark-based or CommonMark-compatible.

Gruber never endorsed CommonMark. His position has consistently been that Markdown's ambiguities are features, not bugs — they give implementers freedom to make sensible choices for their context. There's a philosophical argument in there about whether a format designed for human readability should be nailed down to the last whitespace character. But in practice, the inconsistencies were causing real problems for real users, and CommonMark fixed that.

GitHub Flavored Markdown

GitHub started using Markdown for README files, issues, and comments early on, and quickly ran into the limitations of the original syntax. Tables, task lists, strikethrough text, fenced code blocks with language-specific syntax highlighting — none of these existed in Gruber's Markdown. So GitHub added them.

GitHub Flavored Markdown (GFM) is formally defined as a strict superset of CommonMark⁵. That means every valid CommonMark document is also valid GFM, but GFM adds several extensions:

Tables use pipe characters and hyphens to define rows and columns:

| Language | Typing     | First appeared |
|----------|------------|----------------|
| Python   | Dynamic    | 1991           |
| Rust     | Static     | 2010           |
| Go       | Static     | 2009           |

Task lists add checkboxes to list items:

- [x] Parse HTML
- [x] Extract content
- [ ] Convert to Markdown

Strikethrough uses double tildes: ~~deleted text~~.

Fenced code blocks with backtick syntax and optional language identifiers for syntax highlighting were popularized by GFM (though CommonMark later incorporated them too).

Autolinks automatically turn URLs into clickable links without requiring angle brackets or explicit link syntax.

GFM also defines a disallowed raw HTML extension that filters certain HTML tags for security reasons — you can't embed <script>, <textarea>, or <style> tags in GFM content on GitHub.com⁵.

The current GFM spec is version 0.29-gfm, published April 6, 2019, and is maintained by GitHub⁵. Its influence extends far beyond GitHub — when people say "Markdown" today, they usually mean something close to GFM with tables and fenced code blocks.

The syntax, quickly

If you've used any modern text editor or documentation platform, you've probably written Markdown without knowing the formal syntax. The essentials:

Headings use hash characters. One # for H1, two ## for H2, up to six ###### for H6. There's also an older "setext" style where you underline text with = for H1 or - for H2, but almost nobody uses it anymore.

Emphasis uses asterisks or underscores. Single * or _ for italic, double ** or __ for bold. Most people stick with asterisks.

Links come in two flavors. Inline: [visible text](url). Reference-style: [visible text][ref] with [ref]: url defined elsewhere in the document. Reference-style links make long documents more readable, though inline links dominate in practice.

Images are identical to links but with a leading exclamation mark: ![alt text](image-url).

Code uses backticks. Inline code gets single backticks: `variable`. Code blocks get triple backticks (or tildes) on their own lines, optionally followed by a language identifier:

```python
result = extract(html, output_format="markdown")
```

Blockquotes use > at the start of each line, mirroring email quoting conventions.

Lists use -, *, or + for unordered items and numbers followed by a period for ordered items. The actual numbers don't matter — Markdown renderers generate sequential numbering regardless of what you type¹.

That's essentially the whole language. You can write a complete technical document with just these elements. The simplicity is deliberate.

Why LLMs love Markdown

Here's where things get interesting for anyone working with language models.

A large fraction of high-quality LLM training data comes from sources that use Markdown natively: GitHub repositories (READMEs, documentation, issues, pull requests), Stack Overflow posts, technical blogs built on static site generators, and documentation platforms like ReadTheDocs and GitBook. Models like GPT-4, Claude, and Llama have processed billions of tokens of Markdown during training, which means they've internalized the format's conventions deeply⁶.

When an LLM sees ## Methodology in its context window, it doesn't just recognize two hash characters — it understands that what follows is a section heading, that it's subordinate to the preceding # heading, and that it governs the paragraphs below it until the next heading of equal or higher level. The model learned this from seeing millions of examples. Try getting the same structural understanding from raw HTML tags buried in noise.

Token efficiency is the other factor. HTML is verbose by nature. A simple heading costs you <h2>, the text, </h2>, plus likely some class attributes and wrapper divs — easily 20+ tokens of overhead per heading. In Markdown, the same heading is ## plus the text — two tokens of overhead. Across a typical article, converting HTML to Markdown reduces token count by anywhere from 30% to 70%, depending on how tag-heavy the original page was⁷.

For RAG (Retrieval Augmented Generation) pipelines, this isn't only an optimization — it's a practical necessity. If you're stuffing 10 retrieved chunks into a prompt, Markdown's low overhead means more actual content per context window.

The `/llms.txt` proposal

In September 2024, Jeremy Howard — co-founder of Answer.AI and creator of the fast.ai deep learning library — published a proposal for a new convention: /llms.txt⁸.

The idea is straightforward. Websites that want to be LLM-friendly place a file at /llms.txt (analogous to /robots.txt) containing a curated, Markdown-formatted overview of the site's content. The file includes an H1 title, an optional summary in a blockquote, descriptive sections, and H2-delimited lists of links to Markdown versions of key pages⁹.

The choice of Markdown as the format wasn't accidental. Howard's rationale was that Markdown is "the most widely and easily understood format for language models"⁸. The format is simultaneously human-readable, LLM-readable, and machine-parseable using standard tools — you can process an /llms.txt file with regex, a Markdown parser, or just read it with your eyes.

The proposal also suggests that sites make individual pages available as Markdown by appending .md to the URL. So example.com/docs/getting-started would also be available at example.com/docs/getting-started.md.

Adoption has been growing. Anthropic added llms.txt and llms-full.txt to Claude's documentation by November 2024⁹. Documentation platforms like Mintlify and ReadTheDocs added built-in support. The proposal effectively codified what many developers were already doing informally — providing Markdown as the preferred format for machine consumption.

Markdown in content extraction

Content extraction — pulling the actual article text out of a web page while discarding navigation, ads, footers, and boilerplate — is the problem that tools like Trafilatura solve. And when you're extracting content for downstream NLP or LLM use, the output format matters.

Plain text is the simplest output: just the words, no structure. Good for embeddings, bad for anything that needs to understand document hierarchy. JSON wraps the extracted text with metadata fields (author, date, title), which is great for structured pipelines but adds token overhead from the field delimiters. HTML preserves the most structure but brings the most noise.

Markdown sits in the middle — and for most LLM-adjacent use cases, it's the sweet spot. You keep the heading hierarchy, the list structure, the links, and the tables, but the formatting overhead is negligible. A ## heading costs almost nothing compared to <h2 class="article-heading">.

This is exactly why the format comparison shows Markdown adding only about 10% token overhead versus plain text, while HTML adds 50% or more. For content extraction pipelines feeding into RAG systems, that efficiency gap means you can fit substantially more content into each LLM call.

The HTML to Markdown conversion article covers the specific tools and approaches for turning web pages into clean Markdown — from rule-based converters like Turndown to full extraction engines.

How contextractor handles Markdown

Contextractor extracts with the Rust port of Trafilatura — the original Python algorithm reimplemented in Rust and called from TypeScript through a napi-rs binding, so there's no Python runtime involved. Markdown is the default output format. When you extract content without specifying a format, you get Markdown.

The extractor preserves text formatting when the output format is Markdown — headings, bold/italic emphasis, lists, and links all survive the extraction process and appear as proper Markdown syntax in the output. Metadata (title, author, date, source URL) is tracked separately and can be emitted alongside the Markdown for pipelines that need provenance.

Contextractor exposes Markdown output through both its interfaces:

Apify actor — The save array controls which formats are written and where, using format-destination tokens (destination is dataset or kvs). The default is ["markdown-kvs"], so Markdown is saved to the key-value store out of the box:

{
  "startUrls": [{ "url": "https://example.com/" }],
  "save": ["markdown-kvs"]
}

Web playground — The Markdown checkbox in the output format selector is checked by default. The extracted content appears in the output panel with .md extension.

Making Markdown the default wasn't a controversial decision. For most users extracting web content — whether for feeding LLMs, building knowledge bases, archiving articles, or populating documentation — Markdown gives the best balance of structure preservation and compactness. If you're doing embeddings and don't need structure, switch to plain text. If you need machine-parseable metadata, use JSON. But start with Markdown, and you'll rarely need to change.

What Markdown doesn't do

Markdown has real limits. Complex tables with merged cells, column spans, or nested tables don't work — Markdown's pipe-delimited table syntax is strictly grid-based. Mathematical notation requires extensions (LaTeX via $ delimiters, supported by some parsers but not part of CommonMark or GFM). Footnotes are a GFM extension and not universally supported. And there's no native way to do things like colored text, custom layouts, or embedded interactive widgets.

For content extraction specifically, the main limitation is that Markdown can't represent everything that exists in HTML. If a page uses <details>/<summary> elements, complex forms, or semantic HTML5 sectioning elements like <aside> and <figure> with captions, those nuances get flattened or lost in the Markdown conversion.

But for the purpose of preserving the content of a web page — the text, its structure, and its references — Markdown captures nearly everything that matters and discards nearly everything that doesn't.

A format that aged well

Most technologies from 2004 are either dead, unrecognizable, or maintained by a single person out of stubbornness. Markdown is none of those. Its survival isn't just luck — the core design decision of using plain-text characters that visually resemble their rendered output turned out to be more durable than anyone expected.

The ## heading looks like a heading even without rendering. A - item list looks like a list. A > quote looks indented. These aren't arbitrary syntax choices; they're visual metaphors that work in any context — a terminal, an email, a code review, a chat message, or an LLM's context window.

That visual self-evidence is probably why Markdown won the format wars for LLM input. Not because someone decided it should be the standard, but because it was already everywhere in the training data, already readable by models, and already compact enough to be practical. The /llms.txt proposal just made official what was already true.

Gruber built something for bloggers in 2004. Swartz helped shape it at 17. MacFarlane gave it a proper spec a decade later. Howard picked it as the format for AI-readable web content another decade after that. None of them were solving the same problem, but they all reached for the same tool.

Citations

John Gruber: Markdown. Daring Fireball. Retrieved April 14, 2026 ↩ ↩² ↩³ ↩⁴
Aaron Swartz: Aaron Swartz's A Programmable Web. Retrieved April 14, 2026 ↩
John MacFarlane: CommonMark Spec. Version 0.31.2. Retrieved April 14, 2026 ↩ ↩² ↩³ ↩⁴
Jeff Atwood: Standard Markdown is now Common Markdown. Coding Horror, September 2014. Retrieved April 14, 2026 ↩ ↩²
GitHub: GitHub Flavored Markdown Spec. Version 0.29-gfm. Retrieved April 14, 2026 ↩ ↩² ↩³
Why LLMs Love Markdown — The Best Format for AI Processing. Craft Markdown. Retrieved April 14, 2026 ↩
How to Reduce LLM Token Usage: A Practical Engineering Guide. Web2MD. Retrieved April 14, 2026 ↩
Jeremy Howard: /llms.txt — a proposal to provide information to help LLMs use websites. Answer.AI, September 3, 2024. Retrieved April 14, 2026 ↩ ↩²
The /llms.txt file. Retrieved April 14, 2026 ↩ ↩²

Updated: July 5, 2026