Web scraping and the law in 2026 — what developers need to know
This is not legal advice. I'm a developer writing about the legal landscape as I understand it. If you're making decisions that could get your company sued, talk to a lawyer who actually passed the bar.
Five years ago, most scraping discussions boiled down to "is it legal?" and the honest answer was "probably, if the data is public." That answer got a lot more complicated. Between 2021 and 2025, the US Supreme Court narrowed the Computer Fraud and Abuse Act, robots.txt became an actual internet standard, the EU passed an AI Act with teeth around training data, and major publishers started filing copyright suits against AI companies. If you're building anything that fetches web content programmatically — a content extraction pipeline, a search index, an anti-bot testing tool — you need to know where the lines are.
The CFAA question is (mostly) settled
The Computer Fraud and Abuse Act has been the US statute most commonly thrown at scrapers since the 1990s. Its key phrase — accessing a computer "without authorization" — was vague enough that companies tried to use it against anyone who ignored their terms of service.
Two cases changed that.
In June 2021, the Supreme Court decided Van Buren v. United States and adopted what Justice Barrett called a "gates-up-or-down" test1. The CFAA only applies when someone bypasses an access gate that's actually closed — like a login wall or password protection. Using a system in a way the owner doesn't like, but that you're technically allowed to access, isn't a CFAA violation. The Court explicitly rejected the broader reading that would have made the CFAA a catch-all for any computer misuse.
The Ninth Circuit applied that logic to scraping the same year in hiQ Labs v. LinkedIn. The court held that scraping publicly accessible data — profiles that anyone could view without logging in — didn't violate the CFAA2. LinkedIn couldn't just send a cease-and-desist letter and then claim any further scraping was "without authorization."
The case settled on December 7, 2022. hiQ agreed to a permanent injunction and a $500,000 judgment3. But here's the part that matters for precedent: the settlement didn't overturn the Ninth Circuit's holding. The CFAA "without authorization" language still doesn't reach publicly accessible data. hiQ got nailed on other grounds — breach of contract, using fake accounts to access password-protected pages, and common law trespass to chattels.
The practical takeaway: if data is behind a login, don't scrape it without permission. If it's genuinely public, the CFAA probably doesn't apply — but terms of service still can.
Robots.txt is no longer just a gentleman's agreement
For 28 years, robots.txt was a de facto standard with no formal specification. Martijn Koster posted the original convention in 1994, and everyone just sort of... followed it. In September 2022, the IETF published RFC 9309, co-authored by Koster and Google engineers, making the Robots Exclusion Protocol an official Proposed Standard4.
Why does that matter legally? Because the EU AI Act now references it directly.
Article 53(1)(c) of the EU AI Act requires providers of general-purpose AI (GPAI) models to implement policies that comply with EU copyright law — specifically the text and data mining (TDM) opt-out mechanism from Article 4 of the 2019 Copyright Directive5. The GPAI Code of Practice, published on July 10, 2025, goes further: signatories must identify and comply with "machine-readable protocols used to express rights reservations," explicitly naming robots.txt and subsequent IETF versions6.
That turns a polite convention into something with real consequences. A publisher who sets Disallow: / for your bot in robots.txt is now exercising a legal right under EU law, not just making a request. GPAI providers who ignore it face enforcement fines of up to 3% of global annual turnover or 15 million euros — whichever is higher — starting August 2, 20267.
I find this fascinating, honestly. A text file that's been purely voluntary for three decades suddenly has regulatory weight because of AI training concerns. Whether the enforcement will actually bite remains to be seen — the EU AI Office is still staffing up — but the legal framework is in place.
For your cookie consent handling and crawling pipelines, this means robots.txt parsing isn't optional anymore. It's compliance infrastructure.
The EU's layered approach to scraping
The EU didn't address scraping with a single law. It's a stack of overlapping regulations, and understanding which one applies depends on what you're scraping and why.
The Copyright Directive (2019/790) — Article 3 allows TDM for research organizations and cultural heritage institutions without restriction. Article 4 extends the exception to everyone else, but only if the rightsholder hasn't opted out "in an appropriate manner, such as machine-readable means"8. That's where robots.txt and the W3C TDM Reservation Protocol come in.
The AI Act (Regulation 2024/1689) — GPAI transparency obligations took effect August 2, 2025. Providers must publish a summary of their training data and respect copyright opt-outs. The European Commission launched a consultation in December 2025 on which machine-readable protocols count as valid opt-out mechanisms — robots.txt is the leading candidate, but metadata-based approaches like ISCC are also under consideration9.
GDPR — This is where things get strict. Even publicly accessible personal data is still personal data under GDPR. The Italian DPA fined Clearview AI 20 million euros in February 2022 for scraping facial images from public websites10. France's CNIL imposed an identical fine in October 202211. In December 2024, the CNIL fined KASPR 240,000 euros for scraping LinkedIn contact details — including data from users who'd explicitly restricted their profile visibility12.
The common misconception among developers: "If it's public, I can take it." Under GDPR, that's wrong. You still need a legal basis — typically legitimate interest — and you need to be able to demonstrate proportionality, data minimization, and that you've conducted a balancing test. The KASPR decision made clear that even B2B lead-generation scraping of professional data isn't safe just because the data is on LinkedIn.
Publishers fight back over training data
The copyright battles over AI training data are the wildcard.
In December 2023, The New York Times filed suit against OpenAI and Microsoft, alleging that millions of copyrighted articles were used to train GPT models without license or consent13. The Times asked for not just monetary damages but the destruction of all models trained on its content. OpenAI's defense: fair use. As of early 2026, the case is still in discovery — OpenAI allegedly deleted potentially relevant evidence, which isn't a great look14.
Meanwhile, Reddit went the licensing route. In February 2024, Reddit struck a $60 million per year deal with Google for access to its content via API, and a separate deal with OpenAI15. Reddit's total data licensing revenue hit $203 million by the time of its IPO filing16. The company has since pushed for "dynamic pricing" — compensation that scales with how essential Reddit content becomes to AI-generated answers.
These two approaches represent the fork in the road for content owners. Sue, or license. The outcomes of the NYT case and similar lawsuits (there are dozens now) will determine whether scraping web content for AI training falls under fair use. Until then, the legal risk is real, and "we'll argue fair use later" is not a compliance strategy.
What this means for your scraping pipeline
Here's a practical checklist — not exhaustive, but it covers the main risk areas:
Access control — Is the data behind authentication? If yes, don't touch it without explicit permission. Post-Van Buren, the CFAA's "without authorization" only kicks in when you bypass an actual gate. But logging in with a fake account (like hiQ did) absolutely counts.
Robots.txt — Parse it. Respect it. Under the EU AI Act and Copyright Directive, ignoring a machine-readable opt-out isn't just impolite — it's potentially illegal if you're training an AI model or operating in the EU market. RFC 9309 defines the format; use a proper parser, not regex4.
Personal data — If your scraping target contains names, email addresses, phone numbers, IP addresses, or anything else that qualifies as personal data, GDPR applies (assuming EU data subjects). You need a legal basis. "Legitimate interest" is probably your only option, and it requires a documented balancing test. Scrape the minimum you need and anonymize early.
Terms of service — hiQ v. LinkedIn established that violating ToS isn't a CFAA violation for public data, but it can still be breach of contract. The risk is real, especially if you've agreed to the terms by creating an account.
Rate limiting and identification — Set a reasonable crawl rate. Identify your bot with a descriptive User-Agent string. This isn't strictly a legal requirement in most jurisdictions, but it's evidence of good faith — and good faith matters when a court is evaluating whether your scraping was reasonable.
Data provenance — Log what you scraped, when, and from where. If you're building training data, the EU AI Act requires a "sufficiently detailed summary" of training content5. Even if you're not subject to the AI Act directly, provenance records make it much easier to respond to takedown requests.
How Contextractor approaches this
Contextractor is a content extraction tool — it takes raw HTML and pulls out the article text, metadata, and structure. It doesn't crawl the web, doesn't train models, and doesn't store your data. The extraction happens on HTML you've already obtained.
That said, the design choices are relevant to compliance. Contextractor uses Trafilatura as its extraction engine, which implements data minimization by design — it strips navigation, ads, tracking scripts, and sidebars, returning only the main content a human would read. If you're building a pipeline that needs to handle personal data carefully, extracting just the article text (and discarding the surrounding markup that often contains user data, comments, and tracking IDs) is a meaningful step toward GDPR proportionality.
The tool also preserves content metadata — title, author, publication date, categories — which helps with data provenance. You can trace what was extracted and from what source, which is exactly what the EU AI Act's transparency requirements are pushing toward.
For the crawling and cookie consent side of things, those decisions happen before Contextractor touches the HTML. But having a clean extraction layer that minimizes what gets stored downstream is one less thing to worry about.
I wrote this in March 2026. Laws change, cases settle, regulations get amended. The EU AI Act enforcement deadline is August 2, 2026 — check back after that to see if the fines actually start landing.
Citations
-
Supreme Court of the United States: Van Buren v. United States, 593 U.S. 374. Decided June 3, 2021 ↩
-
United States Court of Appeals for the Ninth Circuit: hiQ Labs, Inc. v. LinkedIn Corp.. Decided April 18, 2022 ↩
-
Privacy World: LinkedIn's Data Scraping Battle with hiQ Labs Ends with Proposed Judgment. Retrieved March 27, 2026 ↩
-
IETF: RFC 9309 - Robots Exclusion Protocol. Published September 2022 ↩ ↩2
-
Latham & Watkins: EU AI Act: GPAI Model Obligations in Force and Final GPAI Code of Practice in Place. Retrieved March 27, 2026 ↩ ↩2
-
European Commission: The General-Purpose AI Code of Practice. Retrieved March 27, 2026 ↩
-
European Commission: AI Act - Regulatory Framework. Retrieved March 27, 2026 ↩
-
European Parliament: Directive (EU) 2019/790 on Copyright in the Digital Single Market. Retrieved March 27, 2026 ↩
-
European Commission: Consultation on Protocols for Reserving Rights from Text and Data Mining under the AI Act. Retrieved March 27, 2026 ↩
-
European Data Protection Board: Facial Recognition: Italian SA Fines Clearview AI EUR 20 Million. Retrieved March 27, 2026 ↩
-
CNIL: Facial Recognition: 20 Million Euros Penalty Against Clearview AI. Retrieved March 27, 2026 ↩
-
CNIL: Data Scraping: KASPR Fined EUR 240,000. Retrieved March 27, 2026 ↩
-
Harvard Law Review: NYT v. OpenAI: The Times's About-Face. Retrieved March 27, 2026 ↩
-
TechCrunch: OpenAI Accidentally Deleted Potential Evidence in NY Times Copyright Lawsuit. Retrieved March 27, 2026 ↩
-
CBS News: Google Strikes $60 Million Deal with Reddit for AI Training. Retrieved March 27, 2026 ↩
-
TechCrunch: Reddit Says It's Made $203M So Far Licensing Its Data. Retrieved March 27, 2026 ↩
Updated: March 27, 2026