Getting Started

Installation


# Core library (chunking, extraction, validation, storage — no API key needed)
pip install distillcore
 
# With LLM features (classification, structuring, enrichment, OpenAI embeddings)
pip install distillcore[openai]
 
# With file format support
pip install distillcore[pdf]     # PDF extraction
pip install distillcore[docx]    # DOCX extraction
pip install distillcore[html]    # HTML extraction
 
# Everything
pip install distillcore[all]

For alternative embedding providers:


pip install distillcore[local]    # sentence-transformers
pip install distillcore[cohere]   # Cohere embeddings

For the agent layer:


pip install distillcore-agents

Prerequisites

Python 3.11+
An OpenAI API key is needed for LLM-powered stages (classification, structuring, enrichment) and OpenAI embeddings. Not required for standalone chunking, text extraction, validation, or storage.

Set your API key:


export OPENAI_API_KEY="sk-..."

Or pass it in config:


from distillcore import process_document, DistillConfig
 
config = DistillConfig(openai_api_key="sk-...")
result = process_document("report.pdf", config=config)

Standalone Chunking (No API Key)

The fastest way to get started — chunk text without any LLM calls:


from distillcore import chunk, estimate_tokens
 
text = open("document.txt").read()
 
# Paragraph strategy (default) — splits on paragraph boundaries
chunks = chunk(text, strategy="paragraph", target_tokens=500)
 
# Sentence strategy — splits on sentence boundaries
chunks = chunk(text, strategy="sentence", target_tokens=300)
 
# Fixed strategy — sliding window with overlap
chunks = chunk(text, strategy="fixed", target_tokens=500, overlap_tokens=50)
 
# Check token estimates
for i, c in enumerate(chunks):
    print(f"Chunk {i}: {estimate_tokens(c)} tokens")

There’s also an LLM-driven strategy for semantic chunking:


# LLM strategy — GPT-4o groups sentences by topic (requires API key)
chunks = chunk(text, strategy="llm", api_key="sk-...")

And an async version:


from distillcore import achunk
 
chunks = await achunk(text, strategy="paragraph", target_tokens=500)

First Pipeline Run

Process a file


from distillcore import process_document
 
result = process_document("report.pdf")
 
print(result.document.metadata.document_type)   # e.g., "report"
print(result.document.metadata.document_title)  # e.g., "Q4 Financial Report"
print(f"Sections: {len(result.document.sections)}")
print(f"Chunks: {len(result.chunks)}")
print(f"Coverage: {result.validation.end_to_end_coverage:.1%}")

Process raw text


from distillcore import process_text, DistillConfig
 
result = process_text(
    "The court finds that the defendant...",
    config=DistillConfig(openai_api_key="sk-..."),
)

With embeddings


from distillcore import process_document
 
result = process_document(
    "report.pdf",
    embed=True,  # uses OpenAI text-embedding-3-small by default
)
 
# Each chunk now has an embedding vector
print(len(result.chunks[0].embedding))  # e.g., 1536

Save and search


from distillcore import Store
from distillcore.llm.client import embed_texts
 
store = Store()
doc_id = store.save(result)
 
# Generate a query embedding and search
query_emb = embed_texts(["financial reporting requirements"])[0]
results = store.search(query_emb, top_k=5)
for r in results:
    print(r["score"], r["text"][:100])

Using the Agent Layer


from distillcore_agents import Orchestrator
 
async with Orchestrator(openai_api_key="sk-...") as orch:
    result = await orch.process_one("contract.pdf")
 
    print(result.triage.preset)          # "legal"
    print(result.processing.chunk_count) # 24
    print(result.qa.verified)            # True

Next Steps

Configuration — Customize chunk sizes, models, and thresholds
Chunking — Standalone chunking API with 4 strategies
Presets — Domain-specific processing (generic, legal)
Extractors — Supported file formats and custom extractors