process_document()
Process a file through the full 7-stage pipeline.
Signature
def process_document(
source: str | Path,
*,
config: DistillConfig | None = None,
format: str | None = None,
embed: bool = True,
) -> ProcessingResult:Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
source | str | Path | required | File path to process |
config | DistillConfig | None | Pipeline configuration (uses defaults if None) |
format | str | None | Override format detection (e.g., “pdf”, “docx”) |
embed | bool | True | Whether to generate embeddings |
Returns
ProcessingResult with:
class ProcessingResult:
document: Document # structured document with sections
chunks: list[DocumentChunk] # chunked and optionally embedded/enriched
validation: ValidationReport # coverage metrics at each stageExamples
Basic usage
from distillcore import process_document
result = process_document("report.pdf")With configuration
from distillcore import process_document, DistillConfig, ChunkConfig
result = process_document(
"report.pdf",
config=DistillConfig(
chunk=ChunkConfig(target_tokens=300),
enrich_chunks=False,
),
)With custom embedding function
from distillcore import process_document, DistillConfig, EmbeddingConfig
from distillcore.embedding import ollama_embedder
config = DistillConfig(
embedding=EmbeddingConfig(embed_fn=ollama_embedder()),
)
result = process_document("report.pdf", config=config)Without embeddings
result = process_document("report.pdf", embed=False)Async Version
from distillcore import process_document_async
result = await process_document_async("report.pdf")Same parameters and return type. Extraction is offloaded to a thread to avoid blocking the event loop.