Library

HTML explained

HTML is what web pages are made of — a tree of nested elements that browsers render into the pages you see. Tim Berners-Lee created the first version with 18 tags at CERN in 1991. Today the WHATWG living standard defines over 100 elements. Contextractor's HTML output saves the raw page source before extraction, giving you the original markup to process however you need.

JSON explained

JSON grew from a JavaScript naming convention in Douglas Crockford's garage to the most widely used data interchange format on the internet. Contextractor's JSON output wraps extracted text alongside metadata fields — title, author, date, site name, source URL — in a single structured object, ready for pipelines that need machine-parseable fields without regex.

JSONL explained

JSONL (JSON Lines) puts one JSON object per line with no wrapper array and no commas between records. It's the format OpenAI uses for fine-tuning data, what most logging systems ingest, and what Contextractor outputs for batch crawls. Each line is independently parseable, so you can stream, append, and process files line by line without loading everything into memory.

Markdown explained

Markdown started as a Perl script by John Gruber in 2004 and became the default format for technical writing, documentation, and LLM pipelines. Its lightweight syntax preserves headings, lists, and links with minimal token overhead — roughly 10% more than plain text. Contextractor outputs Markdown by default because LLMs handle it natively, trained on millions of GitHub READMEs and Stack Overflow posts.

Plain text explained

Plain text is the simplest output format — just the extracted words with no markup, no formatting, no structural hints. It evolved from 7-bit ASCII through decades of competing code pages until UTF-8 unified everything. Contextractor's plain text output is ideal for embedding pipelines and classification tasks where every token should carry semantic meaning, not formatting syntax.

XML explained

XML emerged from SGML in the late 1990s as a universal data interchange format, dominated web services through the SOAP era, then lost the API wars to JSON. It never went away — DOCX, SVG, RSS, and Maven all run on XML. Contextractor's XML output uses a custom schema that preserves document structure with semantic element names, sitting between Markdown's simplicity and TEI's academic rigor.

XML-TEI explained

XML-TEI follows the Text Encoding Initiative standard, maintained since 1987, for encoding texts with scholarly metadata. It's the de facto format in digital humanities and corpus linguistics. Contextractor's TEI output includes a full teiHeader with bibliographic metadata and validates against the TEI schema — skipping an entire manual annotation step for researchers building web corpora.

Content Formats for LLMs — Choosing What to Feed Your AI Pipeline

Plain text, Markdown, HTML, JSON, and XML-TEI each carry different structural signals into your AI pipeline — and each costs a different number of tokens. Markdown adds just 10% overhead while preserving headings and lists, making it the default for most LLM work. Cleaned HTML can outperform plain text for table-heavy RAG tasks, and JSON is the natural fit when your pipeline needs structured metadata fields.

Cookie Consent Handling for Web Scrapers

Cookie consent banners inject dialog markup into the DOM, contaminate extracted text with "accept cookies" boilerplate, and can block entire pages behind consent walls. Handling them requires a two-layer approach: network-level blocking with filter lists like EasyList Cookie (via @ghostery/adblocker-playwright) to prevent CMP scripts from loading, and DOM-level interaction with tools like autoconsent for anything that slips through. Contextractor uses the Ghostery filter list approach in its Apify pipeline, which covers the majority of consent dialogs without per-site configuration.

Trafilatura vs. Readability vs. Newspaper4k

Trafilatura, readability-lxml, and Newspaper4k are Python's three main open-source content extraction libraries, but they don't do the same thing. Trafilatura leads on F1 accuracy (0.958) with seven output formats and a fallback extraction chain. Newspaper4k is built for news articles with built-in NLP. readability-lxml gives you cleaned HTML and nothing else.

Heuristic vs. ML-Powered Extraction — Trafilatura vs. Jina ReaderLM

Trafilatura uses a multi-stage heuristic pipeline with fallback algorithms — no ML, no GPU, single-digit milliseconds per page. Jina's ReaderLM-v2 is a 1.54B-parameter transformer trained specifically on HTML-to-Markdown conversion, with better structural fidelity but requiring GPU and running orders of magnitude slower. The SIGIR 2023 benchmark found heuristic extractors still outperform neural models on content extraction, though ReaderLM-v2 excels at preserving tables, nested lists, and document formatting that heuristics tend to flatten.

HTML to Markdown for AI — Comparing 8 Conversion Approaches

Converting HTML to Markdown for LLM consumption isn't one problem — it's four. Rule-based converters like Turndown faithfully transform markup but keep all the boilerplate. Content extractors like Trafilatura strip the noise first, cutting token counts by 90%+. ML models like Jina's ReaderLM-v2 produce the cleanest output but need a GPU. Full-service APIs handle JavaScript rendering and anti-bot measures on top.

Structured Data Extraction from HTML

CSS selectors and XPath extract structured data from HTML for fractions of a penny per page, but break when sites redesign. LLM-powered extraction adapts to any layout but costs 100-1000x more at scale. A hybrid pipeline — content extraction first, then LLM structuring on clean text — gets the best of both approaches while cutting LLM costs by 99%.

Skip the Headless Browser — When Content Extraction Beats Playwright

Most scraping projects default to Playwright or Selenium when a plain HTTP request would do. HTTP-based content extraction handles 50-200 pages per second on a single core — headless browsers manage 3-5. This article walks through when you actually need a browser and when you're burning RAM for nothing, with a decision tree and resource benchmarks to settle the question.

Trafilatura: Web Content Extraction with Python

Trafilatura is an open-source Python library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.