Structured data extraction from HTML — traditional scraping meets LLM-powered approaches

You have a product page. You need the price, the title, the SKU, and whether it's in stock. This is structured data extraction — turning messy HTML into clean, typed fields you can put in a database or feed into an application.

There are two fundamentally different ways to do it, and the gap between them is enormous: one costs fractions of a penny per thousand pages and runs in milliseconds; the other costs dollars per page and takes seconds. Both produce JSON. Picking the wrong one for your use case either wastes money or wastes engineering time.

The traditional approach: selectors and patterns

CSS selectors, XPath expressions, and regular expressions have been the workhorses of web scraping since the early 2000s. The idea is simple — you study a page's DOM structure, write expressions that match the elements you want, and extract their text or attributes.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

product = {
    "title": soup.select_one("h1.product-title").text.strip(),
    "price": soup.select_one("span.price").text.strip(),
    "in_stock": "Out of Stock" not in soup.select_one(".availability").text,
}

That's it. Parse the HTML, grab nodes by selector, done. XPath gives you more power — you can traverse up the tree, match on text content, use positional predicates:

from lxml import etree

tree = etree.HTML(html)
price = tree.xpath('//div[@class="product-info"]//span[contains(@class, "price")]/text()')[0]

This approach is fast. Parsing an HTML document with lxml and running a handful of XPath queries takes 5-50 milliseconds depending on page size. At scale — say a million product pages — you're looking at a few dollars of compute, mostly network I/O.

The problem everyone knows about: selectors are brittle. A site redesign, a new CSS framework, even a renamed class name — and your scraper breaks silently, returning empty fields or wrong data. Every target site needs its own set of selectors, and maintaining dozens of them is a real job.

That said, for stable, well-structured sites (government databases, APIs-with-HTML-wrappers, your own internal tools), selectors are still the right answer. I'd argue they're underrated right now because the LLM hype makes them sound obsolete. They're not.

LLM-powered extraction

The newer approach skips selectors entirely. You define a schema — what fields you want and their types — send the page content to a language model, and get structured JSON back. The model figures out where the data lives based on semantic understanding, not DOM position.

Traditional selectors vs LLM extraction comparisonComparison of traditional selector-based and LLM-powered extraction approaches

Firecrawl's JSON mode

Firecrawl offers this as a managed service. Their /scrape endpoint accepts a jsonOptions parameter with a JSON Schema, and the response comes back as structured data matching your schema1. Under the hood, it's an LLM call — which is why JSON mode costs 5 credits per page (1 base + 4 for JSON format) instead of the 1 credit for plain scraping2.

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="...")

result = app.scrape_url("https://example.com/product/123", params={
    "formats": ["extract"],
    "extract": {
        "schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"},
                "currency": {"type": "string"},
                "in_stock": {"type": "boolean"}
            },
            "required": ["title", "price"]
        }
    }
})

product = result["extract"]

At Firecrawl's Standard tier ($83/month for 100,000 credits, as of March 2026), that's roughly $0.004 per page with JSON extraction — about 4x the cost of plain scraping3. Cheap enough for moderate volumes, but it adds up fast at scale.

LangChain structured output

If you're already running your own LLM calls, LangChain gives you with_structured_output() — a method on any chat model that supports tool calling. You pass it a Pydantic model or JSON Schema, and the model's response gets parsed and validated automatically4.

from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field

class Product(BaseModel):
    title: str
    price: float
    currency: str = Field(default="USD")
    in_stock: bool

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(Product)

product = structured_llm.invoke(
    f"Extract product information from this page:\n\n{page_text}"
)

This uses OpenAI's structured output mode under the hood — the model is constrained to produce valid JSON matching your schema5. No parsing, no regex on the response, no "oops the model returned markdown instead of JSON."

The older create_extraction_chain still exists in LangChain but it's effectively deprecated in favor of with_structured_output()6.

Where each wins

Neither approach is universally better. The decision depends on three things: how stable the target pages are, how many you're processing, and how much you're willing to spend on engineering versus API costs.

FactorSelectors (CSS/XPath)LLM extraction
Speed per page5-50ms2-10 seconds
Cost per 10K pages~$0.01$100-1,000
Setup time per siteHours (inspect DOM, write selectors)Minutes (define schema)
MaintenanceHigh (breaks on redesigns)Low (adapts to layout changes)
Accuracy on known structureNear-perfect~95% (can hallucinate)
Handles unknown layoutsNoYes
Deterministic outputYesNo

The Crawl4AI documentation puts it bluntly: "99% of developers who think they need LLM extraction are wrong"7. That's a bit strong, but the point stands for pages with consistent HTML structure — a CSS selector will be faster, cheaper, and more reliable every time.

Where LLMs genuinely shine is heterogeneous pages. If you're extracting data from thousands of different sites — each with its own layout, naming conventions, and markup patterns — writing selectors for each one isn't feasible. An LLM can handle a product page from Amazon, a listing from Craigslist, and a spec sheet from some manufacturer's WordPress site, all with the same schema definition.

The cost math at scale

Here's where opinions start to diverge from reality. People underestimate how fast LLM costs compound.

Traditional extraction — 10,000 pages through BeautifulSoup or lxml on a $5/month VPS: negligible cost. You're bottlenecked by network, not compute. Call it $0.01 in compute terms.

Firecrawl JSON mode — 10,000 pages at 5 credits each = 50,000 credits. That's half the Standard plan's monthly allocation ($83/month), so roughly $41.50 for those 10K pages32.

Direct LLM calls — sending raw HTML to GPT-4o at $2.50 per million input tokens (as of March 2026)8. A typical product page is 50-150K tokens of raw HTML. That's $0.13-0.38 per page just in input tokens. Ten thousand pages: $1,300-3,800. And that's the cheap model.

The numbers get absurd quickly. A million pages through GPT-4o with raw HTML would cost $130,000+ in API fees alone. With selectors, the same job costs a few hundred dollars in compute and bandwidth.

This is where a hybrid approach starts to make obvious sense.

The hybrid pipeline

The expensive part of LLM extraction isn't the "extraction" — it's the "sending 150KB of HTML" part. A raw product page might be 100K tokens. The actual product information in that page is maybe 200 words — 300 tokens. If you could send just the relevant text instead of the entire DOM, you'd cut LLM costs by 99%.

That's exactly what content extraction does.

Hybrid pipeline architecture diagramTwo-phase hybrid pipeline: content extraction followed by LLM structuring

The pipeline works in two phases:

Phase one — run the raw HTML through a content extractor like Contextractor (which uses Trafilatura under the hood). This strips navigation, ads, sidebars, and scripts, leaving just the readable text. It costs essentially nothing — a few milliseconds per page, no API calls. A 150KB HTML page becomes ~2KB of clean text.

Phase two — send that clean text to an LLM with your target schema. GPT-4o-mini at $0.15 per million input tokens processes 500 tokens of clean text for about $0.0001. Even GPT-4o at $2.50/M only costs $0.0013 per page on clean text versus $0.25-0.38 on raw HTML.

For 10,000 pages, the hybrid approach costs roughly $1-3 in LLM fees. Compare that to $1,300-3,800 for raw HTML through the same model.

import httpx
from langchain_openai import ChatOpenAI
from pydantic import BaseModel

class Product(BaseModel):
    title: str
    price: float
    currency: str
    in_stock: bool
    description: str

# Phase 1: Content extraction via Contextractor API
response = httpx.post(
    "https://api.contextractor.com/extract",
    json={"url": "https://example.com/product/123", "format": "markdown"}
)
clean_text = response.json()["content"]

# Phase 2: LLM structuring on clean text
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(Product)
product = structured_llm.invoke(f"Extract product details:\n\n{clean_text}")

There's also a reliability benefit. LLMs get confused by HTML noise — navigation labels that look like product categories, footer links that seem like related items, ad text that contains prices. Clean text removes those distractions. I've seen extraction accuracy go up measurably just by preprocessing with content extraction, even when the total token count wasn't a concern.

Schema definition: JSON Schema vs. Pydantic

Both the LLM and hybrid approaches need you to define what you want. The two dominant ways to do this are JSON Schema (the universal standard) and Pydantic models (the Python-native approach).

JSON Schema works everywhere — Firecrawl, OpenAI's API directly, any tool that speaks JSON:

{
  "type": "object",
  "properties": {
    "title": {"type": "string", "description": "Product name"},
    "price": {"type": "number", "description": "Price in local currency"},
    "currency": {"type": "string", "enum": ["USD", "EUR", "GBP"]},
    "in_stock": {"type": "boolean"}
  },
  "required": ["title", "price"]
}

Pydantic models are the same thing but with Python's type system, validation logic, and IDE autocomplete:

from pydantic import BaseModel, Field
from enum import Enum

class Currency(str, Enum):
    USD = "USD"
    EUR = "EUR"
    GBP = "GBP"

class Product(BaseModel):
    title: str = Field(description="Product name")
    price: float = Field(gt=0, description="Price in local currency")
    currency: Currency = Currency.USD
    in_stock: bool = True

The Pydantic approach has a practical advantage: you can call .model_json_schema() to generate JSON Schema from your model, and the same class validates the LLM's output at runtime. One definition, two uses. LangChain's with_structured_output() accepts Pydantic models directly4.

For teams already in the Python ecosystem, Pydantic is the obvious pick. For polyglot setups or when you're passing schemas to APIs (Firecrawl, OpenAI directly), JSON Schema is more portable.

When to pick what

If your sources are a known, finite set of pages with stable HTML — CSS selectors, every time. Don't overthink it.

If you're handling diverse, unpredictable page layouts and can afford the per-page cost — LLM extraction (Firecrawl or direct API calls) saves you from writing and maintaining selectors.

If you need LLM extraction at any real scale — hybrid. Extract clean text first, then send the small payload to an LLM. The cost difference between raw-HTML-to-LLM and clean-text-to-LLM is not a rounding error; it's two orders of magnitude.

And if you're reading this thinking "I'll just use Firecrawl for everything" — sure, at small volumes. But check how many credits your workflow actually burns. The 5-credit-per-page cost for JSON extraction means a 100K-credit Standard plan only covers 20,000 pages per month with structured extraction, not the 100,000 the headline number suggests2.

The tools aren't competing. They're layers. Use each where it fits.

Citations

  1. Firecrawl: Extract endpoint documentation. Retrieved March 27, 2026 ↩

  2. Firecrawl: Billing and credit costs. Retrieved March 27, 2026 ↩ ↩2 ↩3

  3. Firecrawl: Pricing plans. Retrieved March 27, 2026 ↩ ↩2

  4. LangChain: Structured output. Retrieved March 27, 2026 ↩ ↩2

  5. OpenAI: Structured outputs. Retrieved March 27, 2026 ↩

  6. LangChain: create_extraction_chain API reference. Retrieved March 27, 2026 ↩

  7. Crawl4AI: LLM-Free strategies. Retrieved March 27, 2026 ↩

  8. OpenAI: API pricing. Retrieved March 27, 2026 ↩

Updated: March 24, 2026