PyPI

Install via pip:

pip install contextractor

Extract content from a URL:

contextractor https://example.com

All CLI options are shared with the npm package — see the CLI reference for the full command reference and config file format.

Python library

The contextractor_engine package exposes a Python API for extracting content directly from HTML strings.

ContentExtractor

The main extraction class. Wraps Trafilatura with configurable presets.

from contextractor_engine import ContentExtractor

extractor = ContentExtractor()

The constructor accepts an optional config parameter:

from contextractor_engine import ContentExtractor, TrafilaturaConfig

extractor = ContentExtractor(config=TrafilaturaConfig.precision())

If no config is provided, TrafilaturaConfig.balanced() is used.

extract()

Extract content in a single format:

result = extractor.extract(html, url="https://example.com", output_format="txt")

if result:
    print(result.content)        # extracted text
    print(result.output_format)  # "txt"

Parameters:

ParameterTypeDefaultDescription
htmlstrrequiredRaw HTML string
urlstr | NoneNoneSource URL (improves extraction quality)
output_formatstr"txt"Output format: txt, markdown, json, xml, xmltei

Returns ExtractionResult or None if extraction fails.

extract_metadata()

Extract metadata (title, author, date, etc.) from HTML:

meta = extractor.extract_metadata(html, url="https://example.com")

print(meta.title)        # page title
print(meta.author)       # author name
print(meta.date)         # publication date
print(meta.description)  # meta description
print(meta.sitename)     # site name
print(meta.language)     # detected language

All fields are str | None.

extract_all_formats()

Extract content in multiple formats at once:

results = extractor.extract_all_formats(html, url="https://example.com")

for fmt, result in results.items():
    print(f"{fmt}: {len(result.content)} chars")

Default formats: ["txt", "markdown", "json", "xml"]. Override with the formats parameter:

results = extractor.extract_all_formats(html, formats=["markdown", "json"])

Returns dict[str, ExtractionResult]. Failed extractions are omitted from the dict.

TrafilaturaConfig

Configuration dataclass controlling extraction behavior. Three presets are available:

from contextractor_engine import TrafilaturaConfig

config = TrafilaturaConfig.balanced()   # default — balanced precision/recall
config = TrafilaturaConfig.precision()  # favor_precision=True — less noise
config = TrafilaturaConfig.recall()     # favor_recall=True — more content

Custom configuration:

config = TrafilaturaConfig(
    favor_precision=True,
    include_links=False,
    include_tables=True,
    deduplicate=True,
    target_language="en",
)

Fields

FieldTypeDefaultDescription
fastboolFalseFast extraction mode (less thorough)
favor_precisionboolFalseHigh precision, less noise
favor_recallboolFalseHigh recall, more content
include_commentsboolTrueInclude comments
include_tablesboolTrueInclude tables
include_imagesboolFalseInclude image descriptions
include_formattingboolTruePreserve formatting
include_linksboolTrueInclude links
deduplicateboolFalseDeduplicate content
target_languagestr | NoneNoneFilter by language
with_metadataboolTrueExtract metadata
only_with_metadataboolFalseOnly keep content with metadata
tei_validationboolFalseValidate TEI output
prune_xpathstr | list[str] | NoneNoneXPath patterns to remove
url_blacklistset[str] | NoneNoneURLs to exclude from extraction
author_blacklistset[str] | NoneNoneAuthors to exclude
date_extraction_paramsdict | NoneNoneCustom date extraction parameters

from_json_dict()

Create a config from a camelCase or snake_case dictionary:

config = TrafilaturaConfig.from_json_dict({
    "favorPrecision": True,
    "includeLinks": False,
})

Unknown keys are ignored. Empty or None input returns balanced() defaults.

ExtractionResult

Dataclass returned by extract() and extract_all_formats():

FieldTypeDescription
contentstrExtracted content
output_formatstrFormat: txt, json, markdown, xml, xmltei

MetadataResult

Dataclass returned by extract_metadata():

FieldTypeDescription
titlestr | NonePage title
authorstr | NoneAuthor name
datestr | NonePublication date
descriptionstr | NoneMeta description
sitenamestr | NoneSite name
languagestr | NoneDetected language

Updated: April 16, 2026