PyPI
Install via pip:
pip install contextractor
Extract content from a URL:
contextractor https://example.com
All CLI options are shared with the npm package — see the CLI reference for the full command reference and config file format.
Python library
The contextractor_engine package exposes a Python API for extracting content directly from HTML strings.
ContentExtractor
The main extraction class. Wraps Trafilatura with configurable presets.
from contextractor_engine import ContentExtractor
extractor = ContentExtractor()
The constructor accepts an optional config parameter:
from contextractor_engine import ContentExtractor, TrafilaturaConfig
extractor = ContentExtractor(config=TrafilaturaConfig.precision())
If no config is provided, TrafilaturaConfig.balanced() is used.
extract()
Extract content in a single format:
result = extractor.extract(html, url="https://example.com", output_format="txt")
if result:
print(result.content) # extracted text
print(result.output_format) # "txt"
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
html | str | required | Raw HTML string |
url | str | None | None | Source URL (improves extraction quality) |
output_format | str | "txt" | Output format: txt, markdown, json, xml, xmltei |
Returns ExtractionResult or None if extraction fails.
extract_metadata()
Extract metadata (title, author, date, etc.) from HTML:
meta = extractor.extract_metadata(html, url="https://example.com")
print(meta.title) # page title
print(meta.author) # author name
print(meta.date) # publication date
print(meta.description) # meta description
print(meta.sitename) # site name
print(meta.language) # detected language
All fields are str | None.
extract_all_formats()
Extract content in multiple formats at once:
results = extractor.extract_all_formats(html, url="https://example.com")
for fmt, result in results.items():
print(f"{fmt}: {len(result.content)} chars")
Default formats: ["txt", "markdown", "json", "xml"]. Override with the formats parameter:
results = extractor.extract_all_formats(html, formats=["markdown", "json"])
Returns dict[str, ExtractionResult]. Failed extractions are omitted from the dict.
TrafilaturaConfig
Configuration dataclass controlling extraction behavior. Three presets are available:
from contextractor_engine import TrafilaturaConfig
config = TrafilaturaConfig.balanced() # default — balanced precision/recall
config = TrafilaturaConfig.precision() # favor_precision=True — less noise
config = TrafilaturaConfig.recall() # favor_recall=True — more content
Custom configuration:
config = TrafilaturaConfig(
favor_precision=True,
include_links=False,
include_tables=True,
deduplicate=True,
target_language="en",
)
Fields
| Field | Type | Default | Description |
|---|---|---|---|
fast | bool | False | Fast extraction mode (less thorough) |
favor_precision | bool | False | High precision, less noise |
favor_recall | bool | False | High recall, more content |
include_comments | bool | True | Include comments |
include_tables | bool | True | Include tables |
include_images | bool | False | Include image descriptions |
include_formatting | bool | True | Preserve formatting |
include_links | bool | True | Include links |
deduplicate | bool | False | Deduplicate content |
target_language | str | None | None | Filter by language |
with_metadata | bool | True | Extract metadata |
only_with_metadata | bool | False | Only keep content with metadata |
tei_validation | bool | False | Validate TEI output |
prune_xpath | str | list[str] | None | None | XPath patterns to remove |
url_blacklist | set[str] | None | None | URLs to exclude from extraction |
author_blacklist | set[str] | None | None | Authors to exclude |
date_extraction_params | dict | None | None | Custom date extraction parameters |
from_json_dict()
Create a config from a camelCase or snake_case dictionary:
config = TrafilaturaConfig.from_json_dict({
"favorPrecision": True,
"includeLinks": False,
})
Unknown keys are ignored. Empty or None input returns balanced() defaults.
ExtractionResult
Dataclass returned by extract() and extract_all_formats():
| Field | Type | Description |
|---|---|---|
content | str | Extracted content |
output_format | str | Format: txt, json, markdown, xml, xmltei |
MetadataResult
Dataclass returned by extract_metadata():
| Field | Type | Description |
|---|---|---|
title | str | None | Page title |
author | str | None | Author name |
date | str | None | Publication date |
description | str | None | Meta description |
sitename | str | None | Site name |
language | str | None | Detected language |
Updated: April 16, 2026