Anti-bot detection in 2026 — five layers scrapers must navigate

Try scraping a major e-commerce site with a vanilla Python requests call and you'll get blocked within seconds. Try the same thing with Playwright and you might last a few minutes longer — but not much. The defenses have gotten genuinely sophisticated, and in 2026, the average enterprise site runs at least three detection layers simultaneously.

The anti-bot industry is worth over $1.5 billion and growing, driven partly by the explosion of AI-powered scraping for LLM training data1. Cloudflare alone reported in mid-2025 that crawling for AI model training accounted for nearly 80% of all AI bot activity on their network. That kind of volume has forced the defense side to evolve fast.

What most scrapers don't realize is that these defenses aren't a single wall — they're layered, each one progressively more expensive to evaluate and harder to circumvent.

Five layers of anti-bot defense stacked by computational costAnti-bot defense stack showing five layers from IP reputation to JS proof-of-work

Layer one: IP reputation

The cheapest check and the first one applied. Before the server even looks at what you're requesting, it knows where you're coming from.

ASN classification is the big one. Every IP address belongs to an Autonomous System Number, and those ASNs are tagged. AWS, Google Cloud, DigitalOcean, Hetzner — these are datacenter ranges, and anti-bot systems treat them with suspicion by default2. A request from a residential Comcast IP starts with a fundamentally different trust score than one from an EC2 instance. That's before anything else happens.

The check is basically a database lookup. Fast, near-zero cost per request.

Beyond ASN, IP reputation systems track:

  • Rate patterns — how many requests has this IP sent in the last minute, hour, day?
  • Blocklist membership — is this IP on known spam or abuse lists?
  • Geographic anomalies — an IP geolocated in Brazil requesting a page with Accept-Language: ja-JP looks off
  • VPN and proxy detection — services like IPQS claim 99.95% accuracy detecting residential proxies2

Residential proxies exist specifically to get around ASN classification, and they work — up to a point. The anti-bot side has responded by building behavioral profiles per IP, tracking whether the traffic from a given residential IP looks like normal browsing or programmatic access.

Layer two: browser fingerprinting

Once a request passes IP checks, the next question is whether it comes from a real browser. This is where things get interesting, because "real browser" is a surprisingly complex claim to verify.

TLS fingerprinting happens at the connection level, before any HTTP traffic is exchanged. The TLS ClientHello message — which ciphers the client supports, in what order, with which extensions — creates a fingerprint. JA3 (developed by Salesforce engineers John Althouse, Jeff Atkinson, and Josh Atkins in 2017) was the first widely adopted TLS fingerprinting method3. Its successor JA4, released in 2023, handles the fact that modern browsers randomize extension order to break naive fingerprinting4.

A Python requests library has a TLS fingerprint that looks nothing like Chrome. Immediate red flag.

HTTP header order is another passive signal. Browsers send headers in a consistent, browser-specific order. Chrome puts sec-ch-ua early; Firefox doesn't send it at all. Anti-bot systems like Akamai Bot Manager check for "incorrect header signatures and common bot-building frameworks" as part of what they call transparent detection5.

Then there's the JavaScript side — if the client executes JS at all:

  • navigator.webdriver is true for automated browsers. Every Selenium, Puppeteer, and Playwright instance sets it.
  • navigator.plugins.length === 0 in headless mode (a real Chrome reports installed plugins)
  • Canvas and WebGL rendering produce hardware-specific outputs. DataDome uses an approach called Picasso — originally from Google — that renders graphic elements via the canvas API and checks whether the GPU output matches what the claimed browser/OS combination should produce6.
  • The Chrome DevTools Protocol (CDP) itself leaves detectable traces. A 2024 analysis showed that a specially crafted console.log with getter functions can detect whether CDP's Runtime.enable has been called7.

I find it funny that navigator.webdriver still catches people. You'd think every scraping framework would patch it by now, but enough don't — or do it badly — that it remains a productive signal.

Layer three: behavioral analysis

This layer only matters for clients that actually render pages and interact with them — meaning headless browsers, not HTTP clients. And it's where the detection gets genuinely hard to evade.

DataDome processes over 5 trillion signals per day and claims sub-2ms response times8. Their behavioral models look at mouse movement entropy, scroll acceleration curves, click timing distributions, and navigation paths. Real humans produce noisy, inconsistent input. Bots produce patterns that are either too perfect (mathematical mouse curves) or too random (uniform distributions that no human hand generates).

Cloudflare takes this a step further with per-customer models — anomaly detection that learns what "normal" looks like for a specific website and flags deviations9. A scraper that navigates directly to product pages without ever visiting the homepage triggers a different signal than one that follows a plausible browsing path.

HUMAN Security (formerly PerimeterX) runs behavioral detection at specific transaction points — login, checkout, search — where bot patterns diverge most sharply from human ones10. Their sensor collects hundreds of features per session.

The arms race here is real. Scraper developers inject synthetic mouse movements with Bezier curves, add random delays, simulate scroll events. Detection vendors respond with more granular timing analysis. It's expensive on both sides.

Layer four: active challenges

When passive detection is inconclusive, the server can throw a challenge. The most familiar form is CAPTCHA — clicking traffic lights, typing distorted text, matching images.

Google's reCAPTCHA v3 runs invisibly and assigns a score from 0.0 to 1.0 based on browsing behavior; the site owner decides the threshold11. hCaptcha works similarly but positions itself as privacy-focused. HUMAN Security has its own variant called HUMAN Challenge.

The problem with visual CAPTCHAs is that they're increasingly solvable by machines. Vision-language models can identify traffic lights and crosswalks with high accuracy. CAPTCHA-solving services — both human farms and AI-based — offer real-time solving APIs for cents per challenge. The traditional CAPTCHA is dying, and the industry knows it.

That's partly why the newer systems — Cloudflare Turnstile, hCaptcha's passive mode — have moved toward invisible challenges that don't depend on visual puzzles at all.

Layer five: JavaScript proof-of-work

This is the newest layer and, in my opinion, the most clever. Instead of asking a client to prove it's human, it asks the client to prove it can do computational work.

Cloudflare Turnstile, announced in September 2022, runs a suite of non-interactive JavaScript challenges in the background: proof-of-work puzzles, proof-of-space analysis, web API probes, and browser-quirk detection12. The difficulty adapts per visitor — a request from a clean residential IP with a normal browser fingerprint gets an easy challenge. A suspicious request gets a harder one.

Proof-of-work is the key insight. A legitimate browser can solve a small cryptographic puzzle in milliseconds without the user noticing. But a scraper running thousands of parallel requests suddenly has to spend actual CPU cycles per request. The cost scales linearly with request volume, which is exactly the economic pressure that makes scraping expensive at scale.

For a simple HTTP client that doesn't execute JavaScript, this layer is a wall. You can't solve a proof-of-work challenge without a JS runtime.

Who runs what

The four major commercial anti-bot vendors each emphasize different layers:

VendorKey detection approachScale
CloudflareML scoring + JS challenges (Turnstile) + per-customer anomaly modelsProxies ~20% of all websites13
AkamaiTransparent detection (header analysis) + behavioral scoring at transaction pointsServes 15-30% of global web traffic
DataDomeCanvas fingerprinting (Picasso) + real-time ML across 5T+ daily signals85,000+ customer-specific models
HUMAN SecuritySensor-based behavioral analysis + per-transaction point detectionHundreds of features per session

Smaller players exist — GeeTest, Kasada, Shape Security (now part of F5) — but these four handle the majority of protected traffic.

Why lightweight extraction triggers fewer defenses

Here's the thing that matters for content extraction: most of these layers target browser automation specifically. An HTTP client that fetches HTML and runs it through a content extractor like Trafilatura simply doesn't present most of these attack surfaces.

Detection surface comparison between headless browsers and HTTP extractionHeadless browser exposes 7 of 7 detection signals; HTTP extraction exposes only 2

A headless browser — Playwright, Puppeteer, Selenium — exposes all seven signal categories. It has a JS environment to fingerprint, behavioral patterns to analyze, canvas to render, CDP traces to detect. Every layer of the defense stack applies.

An HTTP client with content extraction exposes exactly two: IP reputation (unavoidable) and TLS fingerprint (configurable with libraries like curl_cffi or tls-client). It doesn't execute JavaScript, so layers three through five simply don't apply. No behavioral signals to analyze. No canvas to fingerprint. No WebDriver flag to check.

That's not a theoretical advantage. For static content — news articles, blog posts, documentation, the kind of pages where content extraction for LLMs actually matters — an HTTP request with a well-crafted TLS fingerprint and proper headers passes through most defenses without triggering anything.

The trade-off is obvious: if the content requires JavaScript rendering (React SPAs, infinite scroll, client-side hydration), you're back to needing a browser. But for the vast majority of article pages on the web, the HTML is right there in the initial response.

The arms race direction

Where is all this heading?

Per-customer ML models are the trend. Both Cloudflare and DataDome have moved toward training detection models specific to each customer's traffic patterns, rather than relying on global heuristics98. A scraper that works on one site gets flagged on another, even if it uses the same techniques, because the baseline "normal" is different.

AI vs. AI is already happening. Cloudflare noted in 2025 that modern scraping tools use LLMs for semantic understanding of page content and computer vision to solve visual challenges1. The defense side is responding with models trained on adversarial bot behavior. This is an expensive equilibrium.

Proof-of-work is expanding. Turnstile's approach — making each request cost computational work — has a natural economic elegance that CAPTCHAs lack. Expect more vendors to adopt similar mechanisms, especially for API endpoints.

Fingerprinting is getting deeper. WebGPU fingerprinting is emerging as the next frontier beyond Canvas and WebGL. And TLS fingerprinting keeps evolving — JA4+ extends the original JA4 with additional hash components for even more granular identification.

The ethical part

This is worth addressing directly, because the scraping community doesn't talk about it enough.

robots.txt became an official IETF standard in September 2022 as RFC 930914. It's a voluntary protocol — nothing enforces it technically — but it signals a site owner's preferences. Ignoring Disallow directives isn't just impolite; France's CNIL now explicitly considers robots.txt compliance as a factor in Legitimate Interest assessments under GDPR15. That has real legal teeth.

Rate limiting is basic courtesy. One request every 10-15 seconds is a conservative starting point for well-behaved crawlers. If you're getting 429 (Too Many Requests) responses, the site is explicitly telling you to slow down. Ignoring that is how IP ranges get permanently blocklisted.

Don't overwhelm servers. A personal blog running on a $5/month VPS can't handle 10,000 requests in a minute. The fact that you technically can doesn't mean you should. This is the kind of thing that gets entire ASN ranges blocked and makes life harder for everyone.

For a deeper look at the legal landscape, see web scraping law. And if you're dealing with cookie consent during extraction, that's its own can of worms.

Contextractor uses HTTP-based extraction by default — which means it operates at the lightweight end of this detection spectrum. For JavaScript-rendered pages, it can fall back to browser-based fetching, but for most article content, a simple HTTP request plus Trafilatura gets the job done without triggering the deeper defense layers.

Citations

  1. Cloudflare: Building unique, per-customer defenses against advanced bot threats in the AI era. Retrieved March 27, 2026 2

  2. IPQualityScore: Proxy Detection API. Retrieved March 27, 2026 2

  3. John Althouse: Open Sourcing JA3. Salesforce Engineering, 2017

  4. FoxIO: JA4+ Network Fingerprinting. Retrieved March 27, 2026

  5. Akamai: Detection Methods. Retrieved March 27, 2026

  6. DataDome: The Art of Bot Detection: How DataDome Uses Picasso for Device Class Fingerprinting. Retrieved March 27, 2026

  7. Device and Browser Info: How to detect (modified, headless) Chrome instrumented with Puppeteer. Retrieved March 27, 2026

  8. DataDome: Multi-Layered AI: A New Requirement for Sophisticated Bot Protection. Retrieved March 27, 2026 2

  9. Cloudflare: Building unique, per-customer defenses against advanced bot threats in the AI era. Retrieved March 27, 2026 2

  10. HUMAN Security: Bot Defender Detection Overview. Retrieved March 27, 2026

  11. Google: reCAPTCHA v3. Retrieved March 27, 2026

  12. Cloudflare: Turnstile Overview. Retrieved March 27, 2026

  13. Cloudflare: Bot Management. Retrieved March 27, 2026

  14. IETF RFC 9309: Robots Exclusion Protocol. September 2022

  15. PromptCloud: Robots.txt Scraping: Rules, Ethics, and Policy Explained. Retrieved March 27, 2026

Updated: March 23, 2026