distillcore
Skip to Content
Configuration

Configuration

distillcore uses four configuration classes to control pipeline behavior.

DistillConfig

The top-level configuration object. All fields have sensible defaults.

from distillcore import DistillConfig config = DistillConfig( # LLM settings (requires distillcore[openai]) openai_api_key="sk-...", # or set OPENAI_API_KEY env var openai_model="gpt-4o", # model for classification/structuring/enrichment max_tokens=16384, # max tokens per LLM call # Pipeline stages chunk=ChunkConfig(...), embedding=EmbeddingConfig(...), domain=DomainConfig(...), # Feature flags enrich_chunks=True, # enable LLM enrichment enable_ocr=True, # OCR fallback for PDFs # Large document handling large_doc_char_threshold=80_000, llm_page_window_size=15, llm_page_window_overlap=2, # Validation thresholds structuring_coverage_threshold=0.95, chunking_coverage_threshold=0.98, end_to_end_coverage_threshold=0.93, # Security allowed_dirs=None, # restrict file access paths # Progress callback on_progress=None, # Callable[[str, dict], None] )

API key resolution

DistillConfig.resolve_api_key() checks in order:

  1. openai_api_key field
  2. OPENAI_API_KEY environment variable

Validation

warnings = config.validate() # Returns list of warning strings (e.g., missing API key)

ChunkConfig

Controls how documents are split into chunks.

from distillcore import ChunkConfig chunk_config = ChunkConfig( target_tokens=500, # target chunk size in tokens overlap_chars=200, # character overlap between chunks max_tokens=1000, # hard maximum chunk size min_tokens=0, # merge chunks below this (0 = disabled) strategy="auto", # "auto", "paragraph", "sentence", "fixed", "llm" tokenizer=None, # custom Callable[[str], int] for token counting )

Strategy options

StrategyDescription
"auto"(Default) Section-aware: transcripts → sections → fallback paragraph splitting
"paragraph"Split on paragraph boundaries with cascading subsplit for oversized blocks
"sentence"Split on sentence boundaries, greedily fill to target size
"fixed"Sliding window at word boundaries with overlap
"llm"LLM-driven semantic grouping (requires distillcore[openai])

See Chunking for details on each strategy.

EmbeddingConfig

Controls embedding generation.

from distillcore import EmbeddingConfig embedding_config = EmbeddingConfig( model="text-embedding-3-small", # OpenAI model name embed_fn=None, # custom embedding function )

The embed_fn field accepts any callable with signature (list[str]) -> list[list[float]].

DomainConfig

Controls domain-specific LLM prompts for classification, structuring, and enrichment.

from distillcore import DomainConfig domain_config = DomainConfig( name="generic", classification_prompt="...", structuring_prompt="...", transcript_prompt="...", enrichment_prompt="...", parse_classification=None, # custom parser function )

In practice, use presets rather than configuring DomainConfig directly.

Putting it Together

from distillcore import ( DistillConfig, ChunkConfig, EmbeddingConfig, process_document, ) config = DistillConfig( chunk=ChunkConfig(target_tokens=300, overlap_chars=100, strategy="paragraph"), embedding=EmbeddingConfig(model="text-embedding-3-large"), enrich_chunks=False, # skip enrichment for speed ) result = process_document("report.pdf", config=config)