distillcore
Skip to Content
API Referenceprocess_document()

process_document()

Process a file through the full 7-stage pipeline.

Signature

def process_document( source: str | Path, *, config: DistillConfig | None = None, format: str | None = None, embed: bool = True, ) -> ProcessingResult:

Parameters

ParameterTypeDefaultDescription
sourcestr | PathrequiredFile path to process
configDistillConfigNonePipeline configuration (uses defaults if None)
formatstrNoneOverride format detection (e.g., “pdf”, “docx”)
embedboolTrueWhether to generate embeddings

Returns

ProcessingResult with:

class ProcessingResult: document: Document # structured document with sections chunks: list[DocumentChunk] # chunked and optionally embedded/enriched validation: ValidationReport # coverage metrics at each stage

Examples

Basic usage

from distillcore import process_document result = process_document("report.pdf")

With configuration

from distillcore import process_document, DistillConfig, ChunkConfig result = process_document( "report.pdf", config=DistillConfig( chunk=ChunkConfig(target_tokens=300), enrich_chunks=False, ), )

With custom embedding function

from distillcore import process_document, DistillConfig, EmbeddingConfig from distillcore.embedding import ollama_embedder config = DistillConfig( embedding=EmbeddingConfig(embed_fn=ollama_embedder()), ) result = process_document("report.pdf", config=config)

Without embeddings

result = process_document("report.pdf", embed=False)

Async Version

from distillcore import process_document_async result = await process_document_async("report.pdf")

Same parameters and return type. Extraction is offloaded to a thread to avoid blocking the event loop.