The web scraping stack in 2026 -- choosing tools for each layer

Most teams start a scraping project by picking a tool. Crawlee, Scrapy, Firecrawl, maybe just raw Playwright. They write the crawler, get some HTML back, and then realize they've solved maybe two of the seven problems between "I want data from these websites" and "my LLM can use this data."

Web scraping isn't a tool. It's a pipeline. And the pipeline has layers — each with its own failure modes, its own cost curve, and its own set of tools that range from perfectly adequate to absurdly overpriced.

Web scraping stack architecture showing seven pipeline layers

The seven layers of a web scraping pipeline

The seven layers

Here's the stack, top to bottom:

Proxy -- getting your requests through without being blocked. Rendering -- executing JavaScript so you actually see the content. Crawling -- discovering URLs and managing the queue. Extraction -- pulling the useful text out of the HTML noise. Formatting -- shaping the output for whatever consumes it. Storage -- putting it somewhere durable. AI integration -- feeding it into LLMs, RAG pipelines, or agents.

Not every project needs all seven. A static site with 500 pages and no bot protection? You can skip proxy and rendering entirely. A React SPA behind Cloudflare? You need every single layer, and each one will cost you.

Proxy layer

The bottom of the stack. Every HTTP request you make comes from an IP address, and websites keep lists of IPs that behave like scrapers.

Residential proxies route your traffic through real ISP-assigned IPs — actual home connections. They're hard to block because they look legitimate. Bright Data (the Israeli company formerly known as Luminati) runs the largest residential proxy network and charges roughly $4-12 per GB depending on volume and targeting¹. Oxylabs starts at $4/GB for pay-as-you-go residential². Apify's built-in proxy runs about $8/GB for residential³.

Datacenter proxies are cheaper ($0.50-3/GB) but easier to detect. Fine for APIs and cooperative targets, useless against Cloudflare or Akamai.

The honest assessment: if your targets don't actively block scrapers, you don't need proxies at all. If they do, residential proxies are a recurring cost that scales linearly with volume and there's no clever engineering trick to avoid it. Some teams try rotating through free proxy lists — that's a path to unreliable data and wasted engineering time.

For anti-bot specifics, I wrote more about how detection systems work and dealing with cookie consent.

Rendering layer

The web used to be HTML files served by Apache. Now it's JavaScript applications that render in the browser.

Playwright is the default choice for headless browser automation in 2026 — Microsoft-backed, cross-browser (Chromium, Firefox, WebKit), and the browser engine behind most crawling frameworks' headless modes⁴. Puppeteer still works but only does Chromium. Selenium is legacy at this point; it's slower, more verbose, and the API feels like it was designed by committee (it was).

Browserless and similar services run headless browsers in the cloud, so you don't manage Chrome instances yourself. Useful at scale, where running 50 concurrent browser instances on your own hardware is a real operations burden.

The rendering question boils down to: does the target site need JavaScript to show its content? If it's a news article or a blog, probably not — raw HTTP fetch gets you the HTML. If it's a single-page app or a site that loads content dynamically, you need a browser. The difference in throughput is dramatic — HTTP requests run in single-digit milliseconds; browser rendering takes seconds per page. I covered this tradeoff in detail in extraction vs. headless browsers.

Crawling layer

Crawling is URL discovery and queue management. You start with a list of seed URLs, fetch them, find new URLs in the HTML, add those to the queue, and repeat. Sounds simple, but at scale you're dealing with rate limiting, retries, deduplication, politeness delays, and session management.

Crawlee (by Apify, MIT license) is the best option for Node.js/TypeScript teams⁵. It handles browser fingerprinting, proxy rotation, adaptive concurrency, and queue persistence. The Apify actor pattern wraps Crawlee crawlers into deployable units that run on Apify's cloud infrastructure, which is genuinely useful if you don't want to manage your own scraping servers.

Scrapy dominates the Python ecosystem. It's been around since 2008, it's battle-tested, and the middleware system is flexible. But it doesn't do browser rendering natively — you need Scrapy-Playwright or Scrapy-Splash for that.

Crawl4AI is the newcomer that gathered 50,000+ GitHub stars faster than any scraping tool before it⁶. It's aimed squarely at AI use cases — crawl a site, get LLM-ready Markdown. The v0.8.x releases added anti-bot detection with proxy escalation and shadow DOM support. It's opinionated (it bundles extraction and formatting into the crawler) which is either a feature or a limitation depending on your architecture.

I put together a walkthrough of how Crawlee and Contextractor work together for teams that want to keep crawling and extraction as separate concerns.

Extraction layer

This is the layer most people underestimate. You've fetched the HTML — now what? A typical news article sits inside maybe 2KB of useful text buried in 150KB of navigation, ad containers, tracking scripts, and footer legalese. Just stripping HTML tags and keeping the text doesn't work; you end up with menu labels mashed into article paragraphs.

Content extraction isolates the main text from the surrounding noise. Trafilatura is the top-performing open-source extractor — a Python library that runs a heuristic pipeline with fallback to Readability and jusText, achieving an F1 of 0.958 on the ScrapingHub benchmark⁷. Mozilla's Readability (the algorithm behind Firefox Reader View) scores close behind with the highest median F1 in the SIGIR 2023 study⁸.

Contextractor uses Trafilatura as its extraction engine and wraps it in an API — give it a URL or raw HTML, get back clean text with metadata. For AI pipelines specifically, the difference between feeding raw HTML and extracted content into an LLM is enormous. Raw HTML burns tokens on markup that carries zero semantic value. A page that's 223K tokens as HTML might be 3K tokens after extraction⁹.

The benchmark comparison has the full numbers, but the short version: heuristic extractors still beat neural models on heterogeneous web pages, and Trafilatura has the best overall mean across independent evaluations.

Formatting layer

Extraction gives you clean text. Formatting shapes it for whatever comes next.

The dominant format for LLM consumption is Markdown — it preserves headings, lists, and basic structure while staying token-efficient. HTML-to-Markdown conversion is a whole subtopic with more edge cases than you'd expect (nested tables, code blocks inside lists, image alt text preservation).

For structured data pipelines, JSON or XML make more sense. Academic workflows often need TEI-XML. The choice of output format depends entirely on the downstream consumer — there's no universal best.

Contextractor supports Markdown, plain text, and structured JSON output. The Markdown output preserves headings and lists, which matters for RAG pipelines where chunk boundaries should align with document structure.

Storage layer

Where you put the data depends on what you're doing with it.

Vector databases (Pinecone, Weaviate, Qdrant, pgvector) for RAG — you chunk the text, generate embeddings, and store them for similarity search. Object storage (S3, GCS) for raw archives — cheap, durable, no schema to manage. Relational databases for structured fields — title, author, date, URL alongside the content. Apify Dataset if you're running actors on Apify's platform — it handles the storage and export automatically.

Most production systems use at least two of these. The extracted text goes into a vector database for retrieval while the raw HTML goes into object storage for reprocessing when your extraction improves.

AI integration layer

This is where the stack changed the most between 2024 and 2026.

Model Context Protocol (MCP) is the new standard for connecting AI agents to external data sources. Anthropic released the spec in November 2024¹⁰, OpenAI and Google adopted it in early 2025, and by December 2025 it was donated to the Linux Foundation's Agentic AI Foundation with backing from Microsoft, AWS, and Cloudflare¹¹. In March 2026 the ecosystem has 10,000+ active MCP servers and 97 million monthly SDK downloads¹².

What does that mean for scraping? An MCP server for web extraction lets any MCP-compatible AI agent — Claude, ChatGPT, Gemini, Copilot — request web content extraction as a tool call. The agent says "extract this URL," the MCP server handles the entire pipeline (fetch, render if needed, extract, format), and returns clean content. No custom integration code per LLM provider.

LangChain and LlamaIndex still work as orchestration layers for custom RAG pipelines. LangChain has document loaders for web content (WebBaseLoader, SeleniumURLLoader) but they do minimal extraction — you'd feed raw or lightly-processed HTML into your chunks, which is exactly the problem Contextractor solves¹³.

All-in-one vs. composable

Here's the real architectural decision. Do you buy a platform that handles the whole pipeline, or assemble your own from specialized tools?

All-in-one platforms

Firecrawl is the poster child for the all-in-one approach. One API call: give it a URL, get back Markdown, structured data, or screenshots. Pricing: Free (500 lifetime credits), Hobby ($16/month, 3,000 credits), Standard ($83/month, 100,000 credits), Growth ($333/month, 500,000 credits), Scale ($599/month, 1M credits)¹⁴. The catch: each "credit" is one page for basic scraping, but extraction uses a 5x multiplier — so that 100K credit Standard plan only extracts 20K pages with AI features enabled.

I did a detailed comparison in Firecrawl vs. Contextractor. The short version: Firecrawl is convenient but expensive at scale, and its extraction quality doesn't match dedicated tools.

Bright Data went from being a proxy provider to a full scraping platform. Their Web Unlocker handles rendering and anti-bot bypass; their Scraper API handles crawling; their datasets are pre-scraped. The pricing is consumption-based and starts around $500/month for meaningful volume, scaling into thousands for production workloads¹.

Apify sits somewhere in between — it's a platform with a marketplace of 3,000+ pre-built actors (scrapers), but it also gives you Crawlee as an open-source framework to build your own. You can self-host Crawlee and never pay Apify a cent, or deploy on their cloud and pay for compute and proxy bandwidth.

Composable stacks

The alternative: pick the best tool at each layer and wire them together.

A typical composable stack for AI content extraction: Apify Proxy or Oxylabs for the proxy layer, Playwright (via Crawlee's PlaywrightCrawler) for rendering, Crawlee for crawling, Contextractor (wrapping Trafilatura) for extraction, Markdown output for formatting, Pinecone or pgvector for storage, and MCP or LangChain for AI integration.

The tradeoff is real. You own more infrastructure, you write more glue code, and you debug integration issues between components. But you control each layer independently, you can swap components without rewriting the pipeline, and — critically — the cost curve is fundamentally different.

Cost at scale

Monthly cost comparison at different scales

Cost comparison: all-in-one vs composable at 10K, 100K, and 1M pages/month

These are estimated monthly costs. Real numbers depend on target difficulty, proxy needs, and how much rendering you actually need. But the ratios hold.

Scale	Firecrawl	Bright Data	Composable
10K pages/month	$83 (Standard)	~$500 (minimum commitment)	~$20 (Apify free tier + minimal compute)
100K pages/month	$333 (Growth)	~$800 (Web Scraper API + proxy)	~$100 (Apify starter + proxy bandwidth)
1M pages/month	~$1,500 (Scale + overages)	~$2,000+ (proxy + API)	~$350 (self-hosted Crawlee + proxy)

The composable stack's cost advantage grows with scale because extraction and crawling run on your own compute (or Apify's pay-per-use model), not on per-page credit systems. At 1M pages/month, the open-source stack costs 4-6x less than the all-in-one platforms. The tradeoff is engineering time — someone has to maintain the pipeline.

For small-scale work (under 10K pages), Firecrawl's Hobby plan at $16/month is genuinely hard to beat on convenience. You're paying for not having to think about infrastructure. That's a fair trade at low volumes.

Where Contextractor fits

Contextractor occupies the extraction and formatting layers. It doesn't try to be a crawler, doesn't manage proxies, doesn't do storage. It takes HTML (however you got it) and returns clean, structured content.

That's a deliberate design choice. The extraction layer is where quality matters most for downstream AI use — it's the difference between a RAG pipeline that retrieves relevant answers and one that retrieves navigation menu text. And it's the layer where specialized tools consistently outperform all-in-one platforms.

Trafilatura — the engine Contextractor wraps — achieved the best mean F1 across eight evaluation datasets in the SIGIR 2023 benchmark⁸, and held that lead in the 2024 Sandia National Laboratories evaluation¹⁵. The comparison with Jina ReaderLM and the broader benchmark analysis have the specifics.

The composable stack philosophy says: use Crawlee for crawling (it's the best crawler), use Contextractor for extraction (it's the best extractor), use MCP for AI integration (it's the standard). Each tool does one thing well. You don't ask your screwdriver to also be a hammer.

What about legality?

A scraping stack guide would be irresponsible without mentioning that web scraping law is a mess. The short version for 2026: scraping publicly available data is generally legal in the US (hiQ Labs v. LinkedIn, 2022), the EU treats it as a data protection issue under GDPR, and individual sites' Terms of Service create a patchwork of contractual obligations that nobody reads.

The technical architecture doesn't change based on legality, but your choice of targets does. The proxy layer exists partly because of this — residential proxies exist because sites try to block scraping, and sites try to block scraping partly because of legal gray areas around data collection at scale.

Picking your stack

If I had to recommend a default stack for a team building an AI product that needs web content:

For prototyping and small scale: Firecrawl or Crawl4AI. Get something working in an afternoon. Don't over-engineer.

For production at 100K+ pages: Crawlee + Contextractor + your choice of vector database + MCP server. You'll spend a day or two on integration, but the cost savings and quality improvement over all-in-one platforms pay for themselves within a month.

For enterprise at 1M+: self-hosted Crawlee (or Apify actors), Bright Data or Oxylabs for residential proxy, Contextractor for extraction, and a proper data pipeline with monitoring. At this scale, the per-page economics of all-in-one platforms become genuinely painful and the proxy layer becomes your largest cost.

The stack is seven layers deep. Nobody said it was simple. But picking the right tool at each layer — instead of a platform that's mediocre at all of them — is how you build something that scales without burning money.

Citations

Bright Data: Web Scraper API Pricing. Retrieved March 27, 2026 ↩ ↩²
Oxylabs: Residential Proxies Pricing. Retrieved March 27, 2026 ↩
Apify: Pricing. Retrieved March 27, 2026 ↩
Microsoft: Playwright Documentation. Retrieved March 27, 2026 ↩
Apify: Crawlee — Web scraping and browser automation library. Retrieved March 27, 2026 ↩
Crawl4AI: Open-source LLM Friendly Web Crawler. Retrieved March 27, 2026 ↩
Adrien Barbaresi: Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Proceedings of ACL-IJCNLP 2021: System Demonstrations, pp. 122-131 ↩
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein: An Empirical Comparison of Web Content Extraction Algorithms. Proceedings of SIGIR 2023 ↩ ↩²
An Index-based Approach for Efficient and Effective Web Content Extraction. arXiv, December 2025 ↩
Anthropic: Introducing the Model Context Protocol. November 2024 ↩
Linux Foundation: Announces the Formation of the Agentic AI Foundation. December 9, 2025 ↩
Anthropic: Donating the Model Context Protocol and Establishing the Agentic AI Foundation. December 2025 ↩
LangChain: Document Loaders. Retrieved March 27, 2026 ↩
Firecrawl: Pricing. Retrieved March 27, 2026 ↩
Sandia National Laboratories: An Evaluation of Main Content Extraction Libraries. SAND2024-10208, August 2024 ↩

Updated: March 27, 2026