Library
Firecrawl vs. Contextractor
Firecrawl bundles crawling, JS rendering, and extraction into a single API with credit-based pricing. Contextractor focuses purely on content extraction using Trafilatura, the top-ranked extractor in independent academic benchmarks. The right choice depends on whether you need a full-service platform or a focused tool you can compose into your own pipeline.
Building a RAG Pipeline with Clean Web Data
Most RAG tutorials skip the hardest part of the pipeline: getting clean text out of web pages. This walkthrough covers the full path from URL to vector database — crawling with Crawlee, extracting with Trafilatura, chunking, embedding, and storing in Qdrant. It also explains why skipping extraction pollutes your vector space with boilerplate and quietly degrades every query result.
Web Scraping and the Law in 2026
The legal landscape around web scraping changed significantly between 2021 and 2025. US courts narrowed the CFAA to exclude public data scraping, robots.txt became an IETF standard with regulatory weight under the EU AI Act, and GDPR enforcement against scrapers accelerated. If you're building extraction pipelines, understanding these shifts is no longer optional — it's compliance infrastructure.
The Web Scraping Stack in 2026
Web scraping is a seven-layer pipeline -- proxy, rendering, crawling, extraction, formatting, storage, and AI integration -- and picking the wrong tool at any layer costs you money or data quality. This guide maps the best tools at each layer, compares all-in-one platforms like Firecrawl and Bright Data against composable stacks built on Crawlee and Contextractor, and breaks down real costs at 10K, 100K, and 1M pages per month.
Cookie Consent Handling for Web Scrapers
Cookie consent banners inject dialog markup into the DOM, contaminate extracted text with "accept cookies" boilerplate, and can block entire pages behind consent walls. Handling them requires a two-layer approach: network-level blocking with filter lists like EasyList Cookie (via @ghostery/adblocker-playwright) to prevent CMP scripts from loading, and DOM-level interaction with tools like autoconsent for anything that slips through. Contextractor uses the Ghostery filter list approach in its Apify pipeline, which covers the majority of consent dialogs without per-site configuration.
Output Format Showdown: Plain Text vs. Markdown vs. XML-TEI for AI Pipelines
Plain text, Markdown, HTML, JSON, and XML-TEI each preserve different levels of structure from extracted web content — and each costs a different number of tokens. Markdown adds just 10% overhead while keeping headings and lists intact, making it the default for most LLM work. Cleaned HTML can outperform plain text for table-heavy RAG tasks, and JSON is the natural fit when your pipeline needs structured metadata fields.
MCP and Web Extraction — Connecting Scrapers to AI Agents
The Model Context Protocol gives AI agents a standard way to discover and call web extraction tools without custom integration code. MCP servers from Firecrawl and Apify already expose scraping capabilities to Claude Desktop, Cursor, and VS Code. Instead of building scraping pipelines, agents pick the right extraction tool at runtime and get clean content back over JSON-RPC.
Trafilatura vs. Readability vs. Newspaper4k
Trafilatura, readability-lxml, and Newspaper4k are Python's three main content extraction libraries, but they don't do the same thing. Trafilatura leads on F1 accuracy (0.958) with seven output formats and a fallback extraction chain. Newspaper4k is built for news articles with built-in NLP. readability-lxml gives you cleaned HTML and nothing else.
Content Extraction Benchmark 2026 — F1 Scores Across 10 Tools
We benchmarked 10 content extraction tools — Trafilatura, Readability, newspaper4k, goose3, ReaderLM-v2, Firecrawl, Crawl4AI, jusText, readability-lxml, and Contextractor — across news, documentation, forum, and e-commerce pages. Trafilatura leads overall F1 on the ScrapingHub benchmark (0.958), but no single tool wins every category. The full content extraction benchmark breaks down where each tool excels and where it falls apart.
Beyond News Articles — Extracting from Docs, Forums, and E-Commerce
Content extraction tools are optimized for news articles and blog posts, but real-world pipelines need to handle developer documentation, forum threads, e-commerce product pages, and wikis. Each page type requires different Trafilatura configuration strategies, and some are better handled with targeted scraping than general-purpose extraction.
What LLMs Actually See: How HTML Preprocessing Impacts AI Response Quality
The median web page weighs 870KB of HTML — roughly 223K tokens — but the actual article text is typically under 2K tokens. How you preprocess that HTML before feeding it to an LLM determines whether the model spends its context window on content or on navigation menus, ad containers, and tracking scripts. Heuristic extractors like Trafilatura score 0.958 F1 on content extraction benchmarks, while recent research (HtmlRAG, WWW 2025) suggests that pruned HTML with structure preserved can outperform plain text for certain RAG tasks.
Heuristic vs. ML-Powered Extraction — Trafilatura vs. Jina ReaderLM
Trafilatura uses a multi-stage heuristic pipeline with fallback algorithms — no ML, no GPU, single-digit milliseconds per page. Jina's ReaderLM-v2 is a 1.54B-parameter transformer trained specifically on HTML-to-Markdown conversion, with better structural fidelity but requiring GPU and running orders of magnitude slower. The SIGIR 2023 benchmark found heuristic extractors still outperform neural models on content extraction, though ReaderLM-v2 excels at preserving tables, nested lists, and document formatting that heuristics tend to flatten.
The Apify Actor Pattern for Content Extraction at Scale
Apify Actors are serverless programs packaged as Docker images that accept JSON input and produce structured dataset output -- a clean contract for running content extraction at scale. The pattern handles proxy rotation, session management, request queuing, and retry logic so your extraction code can focus on parsing HTML. Contextractor is built as an Actor, using Trafilatura for extraction inside this infrastructure.
Crawlee + Contextractor — Building a Full-Stack Extraction Pipeline
Crawlee handles the hard parts of web crawling — request queues, proxy rotation, session management, and JavaScript rendering. Contextractor handles content extraction. Wire them together on Apify and you get a pipeline that goes from seed URLs to clean, structured text in a dataset, with no per-site selectors to maintain.
HTML to Markdown for AI — Comparing 8 Conversion Approaches
Converting HTML to Markdown for LLM consumption isn't one problem — it's four. Rule-based converters like Turndown faithfully transform markup but keep all the boilerplate. Content extractors like Trafilatura strip the noise first, cutting token counts by 90%+. ML models like Jina's ReaderLM-v2 produce the cleanest output but need a GPU. Full-service APIs handle JavaScript rendering and anti-bot measures on top.
Structured Data Extraction from HTML
CSS selectors and XPath extract structured data from HTML for fractions of a penny per page, but break when sites redesign. LLM-powered extraction adapts to any layout but costs 100-1000x more at scale. A hybrid pipeline — content extraction first, then LLM structuring on clean text — gets the best of both approaches while cutting LLM costs by 99%.
Anti-Bot Detection in 2026 — Five Layers Scrapers Must Navigate
Modern anti-bot systems stack five defense layers — from IP reputation checks to JavaScript proof-of-work puzzles. Headless browsers expose all five attack surfaces, while lightweight HTTP extraction with content extractors like Trafilatura sidesteps most of them entirely. The trade-off matters when you're building extraction pipelines at scale.
Web Content Extraction for LLMs: The Complete Guide
Raw HTML wastes over 99% of an LLM's context window on scripts, navigation, and boilerplate — leaving almost no room for the actual content. Web content extraction strips that noise away, producing clean text that's ready for RAG pipelines, training datasets, and AI agents. The right extraction method depends on your use case, volume, and latency constraints.
Skip the Headless Browser — When Content Extraction Beats Playwright
Most scraping projects default to Playwright or Selenium when a plain HTTP request would do. HTTP-based content extraction handles 50-200 pages per second on a single core — headless browsers manage 3-5. This article walks through when you actually need a browser and when you're burning RAM for nothing, with a decision tree and resource benchmarks to settle the question.
How to Reduce LLM Token Costs by 70% with Smart HTML Cleaning
A typical web page tokenizes to 23,500 tokens, but the actual article content is only about 3,100 tokens -- the rest is navigation, scripts, styles, and boilerplate. Content extraction removes this waste before you send anything to an LLM, saving 70-87% on input token costs across OpenAI, Anthropic, and Google models. At scale, that's thousands of dollars per month.
Why Content Extraction?
You fetch an HTML page and get 150KB of navigation, ads, and scripts — with the actual article buried somewhere in the middle. Content extraction isolates the readable text and throws away the noise. Heuristic tools like Trafilatura still beat neural approaches on real-world pages, which is why they power everything from RAG pipelines to browser reading modes.
Trafilatura: Web Content Extraction with Python
Trafilatura is a Python library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.