JSONL explained

Take a regular JSON array, strip the square brackets, drop the commas between objects, and put each object on its own line. That's it. That's the whole format.

{"url": "https://example.com/page-1", "title": "First page", "content": "..."}
{"url": "https://example.com/page-2", "title": "Second page", "content": "..."}
{"url": "https://example.com/page-3", "title": "Third page", "content": "..."}

JSONL (also called JSON Lines or NDJSON — Newline Delimited JSON) is a text format where every line is a self-contained, valid JSON value. No wrapping array. No commas separating records. Each line stands on its own, parseable independently of every other line in the file.

The format carries the .jsonl file extension. The NDJSON variant uses .ndjson. In practice, the two names describe the same thing with cosmetic differences in their specifications — NDJSON formally recommends the application/x-ndjson media type, while JSON Lines suggests application/jsonl12. Most tools accept both extensions without complaint.

The spec (all of it)

The JSON Lines specification lives at jsonlines.org, maintained by Ian Ward. His GitHub username is wardi, and the spec is hosted as a simple GitHub Pages site3. The entire specification fits in three rules:

  • Text encoding is UTF-8. A byte order mark must not be included.
  • Each line is a valid JSON value. Objects and arrays are most common, but technically any JSON value — strings, numbers, booleans, even null — is allowed.
  • The line separator is \n. The \r\n variant is tolerated because JSON parsers already ignore surrounding whitespace.

That's the whole thing. No versioning, no profiles, no optional extensions. The page includes a few recommendations (use gzip compression, prefer .jsonl as the extension, include a trailing newline after the last record) but the core format is three rules long.

There's a separate NDJSON specification on GitHub, authored by Thorsten Hoeger, Chris Dew, Finn Pauls, and Jim Wilson2. It adds one meaningful constraint: each JSON text must conform to RFC 82594. Since RFC 8259 is the current JSON standard anyway, that's more of a clarification than a restriction.

I find something refreshing about a format whose specification you can read in under two minutes.

Why not just use a JSON array?

The obvious question. If you already know JSON, you already know how to put things in an array:

[
  {"url": "https://example.com/page-1", "content": "..."},
  {"url": "https://example.com/page-2", "content": "..."},
  {"url": "https://example.com/page-3", "content": "..."}
]

This works fine for small datasets. For large ones — hundreds of thousands of records, or files that grow over time — JSON arrays have structural problems that JSONL sidesteps entirely.

Appending is destructive with JSON arrays. To add a new record to a JSON array, you need to open the file, find the closing ], insert a comma after the last existing object, write the new object, and re-add the ]. Or more commonly, you load the entire array into memory, append to it, and serialize the whole thing back to disk. With JSONL, appending means opening the file in append mode and writing one line. That's a single write() call with no reads.

Parsing requires the whole file with JSON arrays. A JSON array is a single syntactic unit — the [ at the beginning and the ] at the end are part of the same value. A standards-compliant parser can't give you the first element until it's verified the closing bracket exists. Some streaming JSON parsers can work around this (SAX-style event parsers exist), but they're working against the grain of the format. JSONL doesn't have this problem. Line one is parseable the moment you've read line one.

Error recovery is simpler with JSONL. If line 847 of a JSONL file contains malformed JSON, you can skip it and keep processing. If a JSON array has a syntax error at byte 1,200,000, the entire file is invalid. The parser can't reliably determine where one element ends and the next begins, because the comma-separated structure is part of the syntax.

Concatenation is trivial with JSONL. To merge two JSONL files, you concatenate them: cat a.jsonl b.jsonl > combined.jsonl. Merging two JSON arrays requires parsing both, combining the arrays, and re-serializing. Or some creative sed work that nobody wants to maintain.

JSON arrayJSONL
Append a recordRewrite file or seek to end and editAppend one line
Read first recordParse entire structureRead first line
Malformed recordEntire file invalidSkip bad line
Merge two filesParse, combine, rewritecat a.jsonl b.jsonl
File is valid JSONYesNo
Memory for 10 GB file~10 GB+One line at a time

The trade-off is that a JSONL file is not valid JSON. You can't just hand it to JSON.parse() or json.loads(). But that turns out to be a feature, not a bug — it forces you into a line-by-line processing pattern that's almost always what you actually want for large data.

Line-by-line processing

The killer feature of JSONL is that standard Unix text tools already understand lines. You don't need special software.

Count records:

wc -l data.jsonl

Get the first 10 records:

head -n 10 data.jsonl

Filter records with jq:

cat data.jsonl | jq -c 'select(.status == "success")'

jq deserves a special mention here. It processes one JSON value at a time by default, which means it handles JSONL natively — no --slurp flag needed, no special mode5. Every line gets parsed independently. For a 50 GB log file, jq will use roughly the same memory as it would for a single line, because it never holds more than one record at a time.

In Python, reading JSONL is a loop:

import json

with open("data.jsonl") as f:
    for line in f:
        record = json.loads(line)
        # process record

No json.load() on the whole file. No building a list in memory. Just one line, one parse, one process, repeat. The jsonlines Python library wraps this pattern with a nicer API and adds validation, but the standard library is all you need6.

In Node.js, the readline module does the same thing:

const fs = require('fs');
const readline = require('readline');

const rl = readline.createInterface({
  input: fs.createReadStream('data.jsonl')
});

rl.on('line', (line) => {
  const record = JSON.parse(line);
  // process record
});

There's nothing sophisticated happening in any of these examples. That's the point. JSONL doesn't require a special parser or library. It's JSON plus newlines, and every language already knows how to deal with both of those things.

JSONL in AI and machine learning

Here's where JSONL went from "convenient format" to "industry standard."

OpenAI requires JSONL for fine-tuning data. Not accepts — requires. When you upload training data for supervised fine-tuning of GPT models, it must be a .jsonl file where each line contains a messages array with role/content pairs7. The format looks like this:

{"messages": [{"role": "system", "content": "You are an assistant."}, {"role": "user", "content": "What's 2+2?"}, {"role": "assistant", "content": "4"}]}
{"messages": [{"role": "system", "content": "You are an assistant."}, {"role": "user", "content": "Capital of France?"}, {"role": "assistant", "content": "Paris"}]}

Each line is one training example. Each line is independently valid. The file can be validated line by line before uploading, and if one example is malformed, you know exactly which line to fix.

Hugging Face's datasets library loads JSONL files with load_dataset('json', data_files='train.jsonl') — notably using the 'json' loader name, not 'jsonl', because the library auto-detects the line-delimited format8. Thousands of datasets on the Hugging Face Hub ship as JSONL files.

The pattern makes sense for training data. ML datasets are naturally collections of independent examples. Each training example doesn't reference or depend on any other. The order might matter for some training procedures, but the parsing doesn't — you never need to see example #500 to understand example #1. JSONL matches this structure exactly.

OpenAI's Evals framework uses JSONL. Anthropic's documentation recommends JSONL for batch API inputs. When you're processing millions of training examples, the ability to stream them without loading everything into memory isn't optional — it's table stakes.

JSONL in logging and analytics

Before JSONL became the AI training format, it was already the default for structured logging.

The Elasticsearch Bulk API accepts NDJSON — it's how you index documents at scale9. Each pair of lines consists of an action metadata object followed by the document source. The format requires that the final line end with a newline, and that no line contains pretty-printed JSON (because the newlines inside a formatted object would break the line-delimited structure). When you're pushing tens of thousands of documents per second into an Elasticsearch cluster, the ability to stream them without assembling a single massive request body matters.

Logstash has a dedicated json_lines codec for reading newline-delimited log entries from TCP streams and files10. Graylog's GELF format uses JSON Lines for log message streams. Kubernetes writes audit logs in JSON Lines format.

The pattern is the same everywhere: when you have a stream of structured events — log entries, metrics, audit records — and you need to write them as they arrive without buffering the entire batch, JSONL is the natural container. Open file, append line, flush. No state to manage, no partial-write corruption risk (assuming individual lines fit in a single write syscall, which they almost always do).

JSONL for web scraping

Web scraping might be where JSONL's append-friendly nature matters most.

A crawler visits pages one at a time (or a few hundred at a time, with concurrency). Each page produces one result. The crawl might take hours and might be interrupted — network failures, rate limits, the machine running out of disk space. Whatever the reason, you want the results you've already collected to be safe.

With a JSON array output, you'd need to keep the file in a valid state throughout the crawl. That means either holding the entire array in memory and writing it at the end (losing everything if the crawl crashes), or doing careful file surgery to maintain the [ ... ] wrapper as you go. Scrapy, one of the most widely used Python scraping frameworks, chose JSON Lines as its default output format for exactly this reason — each scraped item writes as a new line, immediately, without touching existing data10.

If a crawl gets interrupted after 10,000 pages, you have 10,000 valid JSONL lines on disk. Restart the crawl, skip the URLs you already have (a simple jq -r '.url' output.jsonl gives you the list), and append the new results to the same file. Try doing that with a JSON array.

How contextractor uses JSONL

Contextractor extracts content from web pages using Trafilatura, outputting clean Markdown-formatted text by default. When you want batch output from a multi-page crawl, JSONL is one of the available output formats.

The CLI flag is --save jsonl:

contextractor --save jsonl https://example.com/blog/

Each extracted page becomes one line in an output.jsonl file, structured as a JSON object with the URL and the extracted content:

{"url": "https://example.com/blog/post-1", "content": "# First Post\n\nThe extracted markdown content..."}
{"url": "https://example.com/blog/post-2", "content": "# Second Post\n\nMore extracted content here..."}

The content field contains Markdown-formatted text — not raw HTML and not plain text. Contextractor first extracts the main content from each page (stripping navigation, ads, cookie banners, and boilerplate), converts it to Markdown with heading structure, links, and lists preserved, and then wraps that Markdown in a JSON object alongside the source URL. The JSONL output is a collection of these {url, content} pairs, one per line, one per page.

This means you can pipe the output directly into an AI pipeline — the Markdown is already in the format that LLMs handle best, and the JSON wrapper gives you the source URL for attribution and deduplication.

A config file works the same way:

{
  "urls": ["https://example.com/blog/"],
  "save": ["jsonl"],
  "crawlDepth": 1,
  "outputDir": "./extracted"
}

One thing to note: JSONL output is CLI-only in Contextractor. The web-based playground and the Apify actor don't support JSONL — those interfaces return results individually, not as appended batches. JSONL makes sense when you're crawling dozens or hundreds of pages through the command line and want a single output file you can process later.

For a single-page extraction, JSONL doesn't buy you much over plain Markdown or JSON output. Where it shines is the multi-page case: crawl an entire documentation site, append each page's extracted content as a JSONL line, and you end up with a single file that's both streamable and greppable.

Processing JSONL in practice

Some common patterns for working with JSONL files from extraction or scraping runs.

Extract all URLs from a crawl result:

jq -r '.url' output.jsonl

Find pages matching a pattern:

jq -c 'select(.url | test("blog"))' output.jsonl

Count total extracted pages:

wc -l output.jsonl

Convert JSONL to a JSON array (when downstream tools need one):

jq -s '.' output.jsonl > output.json

That last one is worth noting — jq -s (slurp mode) reads all lines into a single array. It's the escape hatch when you need standard JSON, but it loads everything into memory, which defeats the purpose for very large files.

Python filtering with early termination:

import json

with open("output.jsonl") as f:
    for line in f:
        record = json.loads(line)
        if "/blog/" in record["url"]:
            print(record["content"][:200])

Node.js streaming with backpressure:

const fs = require('fs');
const readline = require('readline');

const rl = readline.createInterface({
  input: fs.createReadStream('output.jsonl')
});

rl.on('line', (line) => {
  const { url, content } = JSON.parse(line);
  if (content.length > 1000) {
    console.log(`Long page: ${url} (${content.length} chars)`);
  }
});

None of these examples require installing anything beyond the standard library of your language. That's not an accident — it's the whole design philosophy.

Where JSONL doesn't make sense

JSONL isn't always the right choice. The format has clear weak spots.

Random access is terrible. Want the 500th record? You need to read and skip 499 lines. JSON arrays aren't great at this either (without an index), but databases and formats like Parquet are. If you need to query by field values or jump to specific records, JSONL is the wrong tool — load it into SQLite or a columnar format first.

Nested or relational data is awkward. Each line is independent. If your data has parent-child relationships — say, a blog post and its comments — you either denormalize (repeat the post data on every comment line) or use some convention for relating lines to each other. Neither is elegant. JSONL works best when your records are genuinely independent.

Schema enforcement is nonexistent. Nothing in the format says that every line must have the same fields. Line one might have {"url": "...", "content": "..."} and line two might have {"id": 42, "name": "Bob"}. That's valid JSONL. If you need schema guarantees, you need to enforce them yourself or use a format that has schemas built in, like Avro or Protocol Buffers.

Compression isn't built in. The JSON Lines spec recommends gzip (.jsonl.gz), and most tools handle it fine, but the format itself is plain text. For very large datasets, the overhead of repeating field names on every line — "url" appears on every single line — adds up. Parquet and other columnar formats handle this much better through dictionary encoding and column-level compression.

The ecosystem

The list of tools that speak JSONL keeps growing. Apache Spark reads and writes it natively (their JSON data source assumes line-delimited by default)11. BigQuery requires newline-delimited JSON for data loading — not regular JSON12. ClickHouse has a JSONEachRow format that's functionally identical. Neo4j has JSONL import/export procedures. Kubernetes writes audit logs in it. Shopify's GraphQL Bulk Operations API returns results in it10.

The convergence is striking. These tools were built by different teams, in different languages, for different purposes, and they all independently arrived at "one JSON object per line" as the right streaming format. Not because anyone mandated it, but because the alternatives — JSON arrays for streaming, CSV for structured data, XML for anything at all — all have worse trade-offs for the use cases these tools care about.

JSONL might be one of the least interesting formats in computing. It's JSON, but with newlines. The spec is three rules. There's no committee, no working group, no versioning drama. And that lack of complexity is precisely why it keeps winning. When you need to write structured data one record at a time, read it back one record at a time, and not worry about the state of the file between those operations, there just isn't a simpler option.

Citations

  1. Ian Ward: JSON Lines. Retrieved April 14, 2026 ↩

  2. Thorsten Hoeger, Chris Dew, Finn Pauls, Jim Wilson: NDJSON Specification. Retrieved April 14, 2026 ↩ ↩2

  3. Ian Ward: jsonlines GitHub repository. Retrieved April 14, 2026 ↩

  4. RFC 8259: The JavaScript Object Notation (JSON) Data Interchange Format. Retrieved April 14, 2026 ↩

  5. jqlang: jq Manual. Retrieved April 14, 2026 ↩

  6. Wolph: jsonlines Python library. Retrieved April 14, 2026 ↩

  7. OpenAI: Supervised fine-tuning. Retrieved April 14, 2026 ↩

  8. Hugging Face: Loading a Dataset. Retrieved April 14, 2026 ↩

  9. Elastic: Bulk API. Retrieved April 14, 2026 ↩

  10. Ian Ward: JSON Lines — On the Web. Retrieved April 14, 2026 ↩ ↩2 ↩3

  11. Apache Spark: JSON Files. Retrieved April 14, 2026 ↩

  12. Google Cloud: Loading JSON data from Cloud Storage. Retrieved April 14, 2026 ↩

Updated: April 14, 2026