Library
Cookie Consent Handling for Web Scrapers
Cookie consent banners inject dialog markup into the DOM, contaminate extracted text with "accept cookies" boilerplate, and can block entire pages behind consent walls. Handling them requires a two-layer approach: network-level blocking with filter lists like EasyList Cookie (via @ghostery/adblocker-playwright) to prevent CMP scripts from loading, and DOM-level interaction with tools like autoconsent for anything that slips through. Contextractor uses the Ghostery filter list approach in its Apify pipeline, which covers the majority of consent dialogs without per-site configuration.
Output Format Showdown: Plain Text vs. Markdown vs. XML-TEI for AI Pipelines
Plain text, Markdown, HTML, JSON, and XML-TEI each preserve different levels of structure from extracted web content — and each costs a different number of tokens. Markdown adds just 10% overhead while keeping headings and lists intact, making it the default for most LLM work. Cleaned HTML can outperform plain text for table-heavy RAG tasks, and JSON is the natural fit when your pipeline needs structured metadata fields.
Trafilatura vs. Readability vs. Newspaper4k
Trafilatura, readability-lxml, and Newspaper4k are Python's three main content extraction libraries, but they don't do the same thing. Trafilatura leads on F1 accuracy (0.958) with seven output formats and a fallback extraction chain. Newspaper4k is built for news articles with built-in NLP. readability-lxml gives you cleaned HTML and nothing else.
Heuristic vs. ML-Powered Extraction — Trafilatura vs. Jina ReaderLM
Trafilatura uses a multi-stage heuristic pipeline with fallback algorithms — no ML, no GPU, single-digit milliseconds per page. Jina's ReaderLM-v2 is a 1.54B-parameter transformer trained specifically on HTML-to-Markdown conversion, with better structural fidelity but requiring GPU and running orders of magnitude slower. The SIGIR 2023 benchmark found heuristic extractors still outperform neural models on content extraction, though ReaderLM-v2 excels at preserving tables, nested lists, and document formatting that heuristics tend to flatten.
HTML to Markdown for AI — Comparing 8 Conversion Approaches
Converting HTML to Markdown for LLM consumption isn't one problem — it's four. Rule-based converters like Turndown faithfully transform markup but keep all the boilerplate. Content extractors like Trafilatura strip the noise first, cutting token counts by 90%+. ML models like Jina's ReaderLM-v2 produce the cleanest output but need a GPU. Full-service APIs handle JavaScript rendering and anti-bot measures on top.
Structured Data Extraction from HTML
CSS selectors and XPath extract structured data from HTML for fractions of a penny per page, but break when sites redesign. LLM-powered extraction adapts to any layout but costs 100-1000x more at scale. A hybrid pipeline — content extraction first, then LLM structuring on clean text — gets the best of both approaches while cutting LLM costs by 99%.
Skip the Headless Browser — When Content Extraction Beats Playwright
Most scraping projects default to Playwright or Selenium when a plain HTTP request would do. HTTP-based content extraction handles 50-200 pages per second on a single core — headless browsers manage 3-5. This article walks through when you actually need a browser and when you're burning RAM for nothing, with a decision tree and resource benchmarks to settle the question.
Trafilatura: Web Content Extraction with Python
Trafilatura is a Python library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.