Configuration
distillcore uses four configuration classes to control pipeline behavior.
DistillConfig
The top-level configuration object. All fields have sensible defaults.
from distillcore import DistillConfig
config = DistillConfig(
# LLM settings (requires distillcore[openai])
openai_api_key="sk-...", # or set OPENAI_API_KEY env var
openai_model="gpt-4o", # model for classification/structuring/enrichment
max_tokens=16384, # max tokens per LLM call
# Pipeline stages
chunk=ChunkConfig(...),
embedding=EmbeddingConfig(...),
domain=DomainConfig(...),
# Feature flags
enrich_chunks=True, # enable LLM enrichment
enable_ocr=True, # OCR fallback for PDFs
# Large document handling
large_doc_char_threshold=80_000,
llm_page_window_size=15,
llm_page_window_overlap=2,
# Validation thresholds
structuring_coverage_threshold=0.95,
chunking_coverage_threshold=0.98,
end_to_end_coverage_threshold=0.93,
# Security
allowed_dirs=None, # restrict file access paths
# Progress callback
on_progress=None, # Callable[[str, dict], None]
)API key resolution
DistillConfig.resolve_api_key() checks in order:
openai_api_keyfieldOPENAI_API_KEYenvironment variable
Validation
warnings = config.validate()
# Returns list of warning strings (e.g., missing API key)ChunkConfig
Controls how documents are split into chunks.
from distillcore import ChunkConfig
chunk_config = ChunkConfig(
target_tokens=500, # target chunk size in tokens
overlap_chars=200, # character overlap between chunks
max_tokens=1000, # hard maximum chunk size
min_tokens=0, # merge chunks below this (0 = disabled)
strategy="auto", # "auto", "paragraph", "sentence", "fixed", "llm"
tokenizer=None, # custom Callable[[str], int] for token counting
)Strategy options
| Strategy | Description |
|---|---|
"auto" | (Default) Section-aware: transcripts → sections → fallback paragraph splitting |
"paragraph" | Split on paragraph boundaries with cascading subsplit for oversized blocks |
"sentence" | Split on sentence boundaries, greedily fill to target size |
"fixed" | Sliding window at word boundaries with overlap |
"llm" | LLM-driven semantic grouping (requires distillcore[openai]) |
See Chunking for details on each strategy.
EmbeddingConfig
Controls embedding generation.
from distillcore import EmbeddingConfig
embedding_config = EmbeddingConfig(
model="text-embedding-3-small", # OpenAI model name
embed_fn=None, # custom embedding function
)The embed_fn field accepts any callable with signature (list[str]) -> list[list[float]].
DomainConfig
Controls domain-specific LLM prompts for classification, structuring, and enrichment.
from distillcore import DomainConfig
domain_config = DomainConfig(
name="generic",
classification_prompt="...",
structuring_prompt="...",
transcript_prompt="...",
enrichment_prompt="...",
parse_classification=None, # custom parser function
)In practice, use presets rather than configuring DomainConfig directly.
Putting it Together
from distillcore import (
DistillConfig,
ChunkConfig,
EmbeddingConfig,
process_document,
)
config = DistillConfig(
chunk=ChunkConfig(target_tokens=300, overlap_chars=100, strategy="paragraph"),
embedding=EmbeddingConfig(model="text-embedding-3-large"),
enrich_chunks=False, # skip enrichment for speed
)
result = process_document("report.pdf", config=config)