Configuration

distillcore uses four configuration classes to control pipeline behavior.

DistillConfig

The top-level configuration object. All fields have sensible defaults.


from distillcore import DistillConfig
 
config = DistillConfig(
    # LLM settings (requires distillcore[openai])
    openai_api_key="sk-...",       # or set OPENAI_API_KEY env var
    openai_model="gpt-4o",         # model for classification/structuring/enrichment
    max_tokens=16384,              # max tokens per LLM call
 
    # Pipeline stages
    chunk=ChunkConfig(...),
    embedding=EmbeddingConfig(...),
    domain=DomainConfig(...),
 
    # Feature flags
    enrich_chunks=True,            # enable LLM enrichment
    enable_ocr=True,               # OCR fallback for PDFs
 
    # Large document handling
    large_doc_char_threshold=80_000,
    llm_page_window_size=15,
    llm_page_window_overlap=2,
 
    # Validation thresholds
    structuring_coverage_threshold=0.95,
    chunking_coverage_threshold=0.98,
    end_to_end_coverage_threshold=0.93,
 
    # Security
    allowed_dirs=None,             # restrict file access paths
 
    # Progress callback
    on_progress=None,              # Callable[[str, dict], None]
)

API key resolution

DistillConfig.resolve_api_key() checks in order:

openai_api_key field
OPENAI_API_KEY environment variable

Validation


warnings = config.validate()
# Returns list of warning strings (e.g., missing API key)

ChunkConfig

Controls how documents are split into chunks.


from distillcore import ChunkConfig
 
chunk_config = ChunkConfig(
    target_tokens=500,     # target chunk size in tokens
    overlap_chars=200,     # character overlap between chunks
    max_tokens=1000,       # hard maximum chunk size
    min_tokens=0,          # merge chunks below this (0 = disabled)
    strategy="auto",       # "auto", "paragraph", "sentence", "fixed", "llm"
    tokenizer=None,        # custom Callable[[str], int] for token counting
)

Strategy options

Strategy	Description
`"auto"`	(Default) Section-aware: transcripts → sections → fallback paragraph splitting
`"paragraph"`	Split on paragraph boundaries with cascading subsplit for oversized blocks
`"sentence"`	Split on sentence boundaries, greedily fill to target size
`"fixed"`	Sliding window at word boundaries with overlap
`"llm"`	LLM-driven semantic grouping (requires `distillcore[openai]`)

See Chunking for details on each strategy.

EmbeddingConfig

Controls embedding generation.


from distillcore import EmbeddingConfig
 
embedding_config = EmbeddingConfig(
    model="text-embedding-3-small",   # OpenAI model name
    embed_fn=None,                    # custom embedding function
)

The embed_fn field accepts any callable with signature (list[str]) -> list[list[float]].

DomainConfig

Controls domain-specific LLM prompts for classification, structuring, and enrichment.


from distillcore import DomainConfig
 
domain_config = DomainConfig(
    name="generic",
    classification_prompt="...",
    structuring_prompt="...",
    transcript_prompt="...",
    enrichment_prompt="...",
    parse_classification=None,  # custom parser function
)

In practice, use presets rather than configuring DomainConfig directly.

Putting it Together


from distillcore import (
    DistillConfig,
    ChunkConfig,
    EmbeddingConfig,
    process_document,
)
 
config = DistillConfig(
    chunk=ChunkConfig(target_tokens=300, overlap_chars=100, strategy="paragraph"),
    embedding=EmbeddingConfig(model="text-embedding-3-large"),
    enrich_chunks=False,  # skip enrichment for speed
)
 
result = process_document("report.pdf", config=config)