JSON Explained — Structured Data for Machines

JSON (JavaScript Object Notation) is a text-based data interchange format built on two structures: key-value pairs and ordered lists. It's derived from a subset of JavaScript — specifically, the object literal syntax defined in ECMAScript 3rd Edition (December 1999) — but it isn't tied to JavaScript at all¹. Parsers exist in practically every language you can think of, and quite a few you can't.

The format is small enough to fit on the back of a business card. Six types, a handful of punctuation rules, and that's it. Which is probably why it won.

A format discovered, not invented

Douglas Crockford is very particular about the phrasing. He didn't invent JSON — he discovered it. "I found it, I named it, I described how it was useful," he's said in multiple talks². The distinction matters to him because the syntax was already sitting there in JavaScript, and had been since 1999. Anyone could have pulled it out and given it a name.

The actual discovery happened in April 2001. Crockford and Chip Morningstar were building single-page web applications at a company called State Software — doing AJAX before anyone called it AJAX. They needed to pass data from the server to the browser after the initial page load, and the options were terrible. Internet Explorer supported a primitive form of XMLHttpRequest, but Netscape 4 didn't. Their workaround was to return an HTML document containing a <script> tag that executed a JavaScript object literal and passed it to a callback function³.

That first message was just JavaScript.

There's a fun detail in the story: one of their early messages used do as a key name. Since do is a reserved word in JavaScript, the parser choked on it. Crockford's fix was to require all keys to be quoted strings — which is why JSON keys always need double quotes, even though JavaScript object literals don't³. A pragmatic workaround to a parser bug became a permanent rule in a global standard. That kind of accident shapes more of computing than anyone likes to admit.

In 2002, Crockford bought the domain json.org and posted the grammar along with a reference parser implementation. The site still looks more or less the same today — a single page with railroad diagrams and a list of implementations in dozens of languages⁴.

The standardization journey

JSON's path from a webpage to an Internet Standard took sixteen years and four RFCs, which is either impressively fast or absurdly slow depending on your standards body of reference.

RFC 4627 (July 2006) — Crockford himself wrote the original specification. It was published as "Informational," not as a standard, which meant it described existing practice but didn't carry normative weight. The document was deliberately short — around seven pages — matching the format's philosophy of minimalism⁵.

ECMA-404 (October 2013) — Ecma International published JSON as a formal standard. This one was even shorter than RFC 4627. The committee's explicit goal was to standardize the syntax without adding anything to it. No extensions, no optional features, no versioning. JSON was already everywhere, and the standard was essentially a snapshot of what json.org had described since 2002⁶.

RFC 7159 (March 2014) — Tim Bray edited a revision that replaced RFC 4627 and fixed several ambiguities. The biggest practical change: RFC 4627 had required that a JSON text must be an object or array at the top level. RFC 7159 relaxed this — a standalone string, number, boolean, or null became a valid JSON text⁷. Most parsers had already allowed it.

RFC 8259 (December 2017) — The current specification, also edited by Tim Bray, elevated JSON to a full Internet Standard (STD 90). The key addition: JSON exchanged between systems that aren't part of a closed ecosystem must be encoded as UTF-8. Not "should," not "may" — must. The spec also added stronger language about duplicate object keys (the behavior is "unpredictable" if names aren't unique) and noted that IEEE 754 double-precision is the practical baseline for number interoperability⁸. RFC 8259 references ECMA-404 normatively, and the two organizations committed to keeping the documents aligned going forward.

Four specs, sixteen years, and the format itself barely changed. The grammar Crockford put on json.org in 2002 is functionally identical to the Internet Standard from 2017.

Six types, zero surprises

JSON supports exactly six data types:

String — double-quoted Unicode text with backslash escaping
Number — integer or floating-point, no hex, no leading zeros, no NaN or Infinity
Boolean — true or false, lowercase only
Null — null, lowercase
Object — unordered set of key-value pairs wrapped in {}
Array — ordered sequence of values wrapped in []

That's the complete type system. It's intentionally missing several things that cause endless arguments.

No date type. Dates get shoved into strings, usually as ISO 8601 ("2026-04-14T10:30:00Z"), but there's no formal convention. Some APIs use Unix timestamps as numbers. Some use "April 14, 2026". You find out which one when something breaks in production.

No comments. Crockford removed them deliberately. People had started using comments to embed parsing directives — instructions that only specific parsers understood — which would have destroyed interoperability. His take was that if you need comments, you're probably using JSON as a configuration format, and configuration formats should be a different thing⁹. The ecosystem mostly disagreed: VS Code uses JSONC (JSON with comments), TypeScript's tsconfig.json supports comments, and JSON5 added both comments and trailing commas. But the official spec remains comment-free.

No trailing commas. Every trailing comma your editor strips from a JSON file is the spec doing its job. JavaScript allows them; JSON does not.

No undefined. JavaScript's undefined has no JSON equivalent. Serializing an object property with an undefined value silently drops it. This occasionally causes bugs that are maddening to track down.

No binary data. If you need to embed binary content, you Base64-encode it into a string. It works, but it inflates the payload by roughly 33%.

The format war with XML

The mid-2000s JSON-versus-XML debate was one of those arguments that felt enormous at the time and now looks inevitable in hindsight.

In February 2005, Jesse James Garrett published an essay coining the term "AJAX" — Asynchronous JavaScript and XML. He pointed to Gmail, Google Maps, and Flickr as examples. In a follow-up Q&A, Garrett himself noted that "there's no reason you couldn't accomplish the same effects using a technology like JavaScript Object Notation" instead of XML¹⁰. The "X" in AJAX was already optional before the term caught on.

By December 2005, Yahoo started offering some of its web services in JSON. Developers noticed that parsing JSON in the browser was trivially easy (just call eval() on it — more on why that was a terrible idea shortly), while parsing XML required walking a DOM tree. A ten-line XML response needed forty lines of DOM traversal code. The same data in JSON was just... an object. You could dot-notation your way to any field.

Dave Winer, a prominent XML advocate, reacted in 2006 with the memorable line: "Who did this travesty? Let's find a tree and string them up. Now." Crockford's response: "The good thing about reinventing the wheel is that you can get a round one."³

By 2013, Twitter dropped its XML API entirely, serving only JSON³. The war was over.

Where XML still wins: document markup (XML's original purpose), namespaces (JSON has nothing comparable), schemas with mixed content (text interspersed with markup), and industries where XML is deeply embedded — XBRL for financial reporting, HL7 FHIR for healthcare (which actually uses both JSON and XML), SVG for vector graphics, SOAP in legacy enterprise systems. For moving data between services, though, JSON is the default. It has been for over a decade.

Parsing JSON: from eval() to JSON.parse()

The early history of JSON parsing in browsers is a case study in convenience trumping security.

Before browsers had native JSON support, the standard way to parse JSON in JavaScript was eval(). You'd receive a string from the server, wrap it in parentheses (because {} is a statement block in JavaScript but ({}) is an expression), and evaluate it. Fast, easy, and completely insane — eval() executes arbitrary JavaScript. A malicious server (or a man-in-the-middle) could send back deleteEverything() instead of {"name": "Alice"} and the browser would happily run it¹¹.

Crockford wrote json2.js, a polyfill that used eval() guarded by regex checks to reject obviously dangerous input. Even that approach had bypass vulnerabilities — a proof-of-concept exploit was reported in 2008 that could trigger arbitrary code execution in Firefox 2¹¹.

Native JSON.parse() arrived with ECMAScript 5 in December 2009¹². Unlike eval(), it only parses JSON syntax — no function calls, no assignments, no prototype pollution through parsing alone. Browser support rolled out quickly: IE8, Firefox 3.5, Chrome 3, Safari 4. The eval-based approach died fast once a safe alternative existed.

On the server side, JSON parsing is fast because the grammar is trivial. Six types, no ambiguity, no schema required. A recursive descent parser for JSON is about 400 lines of code in most languages. But "fast" and "safe" aren't the same thing. JSON bombs — deeply nested arrays like [[[[[[[[...]]]]]]]] or very long strings — can exhaust memory or stack depth. The Kubernetes API server had a CVE in 2019 (CVE-2019-11253) where malicious JSON/YAML payloads caused excessive CPU and memory consumption¹³. Setting depth limits and size caps on incoming JSON is boring but necessary.

Duplicate keys are another parsing landmine. RFC 8259 says object names "SHOULD be unique" but doesn't require it. Different parsers handle duplicates differently — some take the first value, some take the last, some throw an error. A 2022 Bishop Fox study tested 49 parsers across multiple languages and found that every language had at least one parser with potentially risky interoperability behavior around key handling¹⁴.

JSON Schema: giving structure to the structureless

JSON itself has no schema language. You can put anything anywhere and it'll parse fine. JSON Schema is a separate specification — not part of any RFC — that lets you describe what a valid JSON document looks like for a given use case.

The current version is Draft 2020-12, which replaced Draft 2019-09¹⁵. It defines a vocabulary for specifying types, required properties, value constraints, string patterns, array tuple validation, and conditional schemas. Here's a trivial example:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "author": { "type": "string" },
    "date": { "type": "string", "format": "date" }
  },
  "required": ["title"]
}

The practical uses go beyond validation. API documentation generators (like the OpenAPI Specification) use JSON Schema to describe request and response bodies. Code generators produce TypeScript interfaces, Python dataclasses, or Go structs from schema files. Form builders render UIs from schemas.

JSON Schema is also how OpenAI's Structured Outputs feature works. When you supply a JSON Schema to the API with strict: true, the model is constrained to only produce output that validates against your schema. On complex schema-following benchmarks, GPT-4o with Structured Outputs scores 100% — compared to under 40% for earlier models without the constraint¹⁶. It's an elegant use of a validation spec as a generation spec.

JSON in AI and ML pipelines

JSON has become the connective tissue of the AI ecosystem, not because anyone designed it for that purpose, but because it was already there when the ecosystem needed a wire format.

Model configuration files — hyperparameters, architecture definitions, training metadata — are overwhelmingly JSON or YAML (which is a superset of JSON, technically). HuggingFace model cards embed metadata as JSON. PyTorch and TensorFlow checkpoint manifests use JSON. The OpenAI, Anthropic, and Google AI APIs all accept and return JSON.

The structured outputs trend is worth paying attention to. When an LLM generates free-form text, downstream systems need regex, prompt engineering, or hope to extract structured data from it. When the same model generates validated JSON, the extraction step vanishes. Every major LLM provider now offers some form of JSON mode or structured output — OpenAI (August 2024), Anthropic, Google, and open-source frameworks like Outlines and Instructor¹⁶.

For content extraction specifically, JSON plays a different role: it's the envelope. When you extract the main content from a web page, the raw text is only part of the value. You also want the title, author, publication date, site name, and source URL — metadata that the extraction process discovers alongside the content. Wrapping all of it in a JSON object gives downstream consumers a single parseable unit instead of loose files with implicit relationships.

How contextractor handles JSON output

Contextractor extracts content with the Rust port of Trafilatura and supports JSON as one of its output formats. The JSON output wraps the extracted content alongside structured metadata fields:

{
  "title": "Understanding Gradient Descent",
  "author": "Jane Chen",
  "date": "2026-03-15",
  "sitename": "ML Engineering Blog",
  "source": "https://example.com/gradient-descent",
  "text": "Gradient descent is an optimization algorithm..."
}

The fields it extracts include title, author, date, sitename, source, categories, and tags — though not every page has all of them. Pages without a detectable author simply omit the field. The text field contains the extracted main content stripped of boilerplate, navigation, ads, and cookie banners.

You can get JSON output from contextractor three ways:

CLI — pass a JSON save route, e.g. --save json-file for a single page (extract-one) or --save json-kvs when crawling (extract). The --save flag is repeatable, so you can emit several formats at once: --save markdown-file --save json-file. The JSON ends up in a .json file alongside whatever other formats you selected.

Apify actor — add json-kvs to the save input (or json-dataset to inline it in the dataset record). With json-kvs, the JSON is stored as a blob in the key-value store, and a URL to it appears in each dataset item.

Web playground — select "JSON with metadata" as the output format. The result appears in the output panel with the metadata fields visible above the extracted text.

The JSON format costs roughly 40% more tokens than plain text extraction because of the metadata wrapper and JSON's "key": "value" syntax overhead. For embedding pipelines where only the semantic content matters, plain text is cheaper. For anything that needs to know who wrote it and when, the metadata envelope pays for itself. The format comparison guide breaks down the trade-offs in detail.

JSON vs. JSONL: one object or many?

Standard JSON wraps everything in a single structure — one object, one array, one root element. That's fine for a single API response. It falls apart when you're processing a million records, because you'd need to parse the entire array into memory before you can access the first item.

JSON Lines (also called Newline Delimited JSON, or JSONL) solves this by putting one JSON object per line, separated by \n. No wrapping array, no commas between records. Each line is independently parseable¹⁷.

{"url": "https://example.com/a", "title": "Page A", "text": "..."}
{"url": "https://example.com/b", "title": "Page B", "text": "..."}
{"url": "https://example.com/c", "title": "Page C", "text": "..."}

The trade-off is simple. JSON is better for single documents, API responses, and configuration files — anywhere you need a complete, self-contained structure. JSONL is better for batch processing, streaming, log files, and data pipelines — anywhere you're dealing with a potentially unbounded sequence of records.

Contextractor outputs JSON: --save json gives you one JSON object per extracted page with metadata. JSON Lines isn't a separate Contextractor format, but the JSON objects map onto it cleanly — concatenate one extraction result per line and you have a .jsonl file ready for piping into tools like jq, loading into databases, or feeding into AI batch processing pipelines. For the relationship between structured extraction and downstream data use, see the structured data extraction guide.

OpenAI's Batch API accepts JSONL for bulk inference. BigQuery and Snowflake support JSONL ingestion natively. Docker and Logstash store logs as JSONL. The format has quietly become the standard for any system that needs to process records one at a time without loading everything into memory first.

What Crockford got right

The best design decision in JSON wasn't any particular feature — it was the decision to stop adding features. Crockford drew a line: no comments, no trailing commas, no extensions, no versioning. The format shipped in 2002 and the grammar hasn't changed since. That stubbornness is exactly why JSON works as a lingua franca. Every parser on every platform in every language agrees on what valid JSON looks like, because there's nothing to disagree about.

That doesn't mean JSON is perfect for every job. It's terrible for configuration (no comments), wasteful for binary data (Base64 overhead), and its lack of a date type has launched a thousand StackOverflow questions. But the scope was never "handle everything." The scope was "move data between programs." And for that, it's been the right tool for over two decades.

The grammar still fits on the back of a business card.

Citations

Ecma International: ECMAScript Language Specification, 3rd Edition. December 1999 ↩
Douglas Crockford: The Discovery of JSON. Retrieved April 14, 2026 ↩
Two-Bit History: The Rise and Rise of JSON. Retrieved April 14, 2026 ↩ ↩² ↩³ ↩⁴
Douglas Crockford: Introducing JSON. Retrieved April 14, 2026 ↩
RFC 4627: The application/json Media Type for JavaScript Object Notation (JSON). July 2006 ↩
Ecma International: ECMA-404 The JSON Data Interchange Standard. October 2013 ↩
RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format. March 2014 ↩
RFC 8259: The JavaScript Object Notation (JSON) Data Interchange Format. December 2017 ↩
Douglas Crockford, via Hacker News discussion: Why I removed comments from JSON. Retrieved April 14, 2026 ↩
Jesse James Garrett: Ajax: A New Approach to Web Applications. February 18, 2005 ↩
Mike Samuel: A web security story from 2008: silently securing JSON.parse. Retrieved April 14, 2026 ↩ ↩²
Ecma International: ECMAScript Language Specification, 5th Edition. December 2009 ↩
Kubernetes: CVE-2019-11253: Kubernetes API Server JSON/YAML parsing vulnerable to resource exhaustion attack. October 2019 ↩
Bishop Fox: An Exploration & Remediation of JSON Interoperability Vulnerabilities. Retrieved April 14, 2026 ↩
JSON Schema: Draft 2020-12. Retrieved April 14, 2026 ↩
OpenAI: Introducing Structured Outputs in the API. August 6, 2024 ↩ ↩²
JSON Lines: JSON Lines Format. Retrieved April 14, 2026 ↩

Updated: July 5, 2026