distillcore
Skip to Content
API Referenceextract()

extract()

Extract text from a file without running the full pipeline.

Signature

def extract( source: str | Path, format: str | None = None, config: DistillConfig | None = None, ) -> ExtractionResult:

Parameters

ParameterTypeDefaultDescription
sourcestr | PathrequiredFile path to extract from
formatstrNoneOverride format detection
configDistillConfigNoneConfiguration (for allowed_dirs validation)

Returns

class ExtractionResult: full_text: str # complete extracted text pages_text: list[PageText] # text broken down by page page_count: int # number of pages

Examples

from distillcore import extract result = extract("report.pdf") print(result.full_text[:200]) print(f"{result.page_count} pages")

Extractor Registry

from distillcore import register_extractor register_extractor(MyCustomExtractor())

See Extractors for details on writing custom extractors.