Getting Started
Installation
# Core library (chunking, extraction, validation, storage — no API key needed)
pip install distillcore
# With LLM features (classification, structuring, enrichment, OpenAI embeddings)
pip install distillcore[openai]
# With file format support
pip install distillcore[pdf] # PDF extraction
pip install distillcore[docx] # DOCX extraction
pip install distillcore[html] # HTML extraction
# Everything
pip install distillcore[all]For alternative embedding providers:
pip install distillcore[local] # sentence-transformers
pip install distillcore[cohere] # Cohere embeddingsFor the agent layer:
pip install distillcore-agentsPrerequisites
- Python 3.11+
- An OpenAI API key is needed for LLM-powered stages (classification, structuring, enrichment) and OpenAI embeddings. Not required for standalone chunking, text extraction, validation, or storage.
Set your API key:
export OPENAI_API_KEY="sk-..."Or pass it in config:
from distillcore import process_document, DistillConfig
config = DistillConfig(openai_api_key="sk-...")
result = process_document("report.pdf", config=config)Standalone Chunking (No API Key)
The fastest way to get started — chunk text without any LLM calls:
from distillcore import chunk, estimate_tokens
text = open("document.txt").read()
# Paragraph strategy (default) — splits on paragraph boundaries
chunks = chunk(text, strategy="paragraph", target_tokens=500)
# Sentence strategy — splits on sentence boundaries
chunks = chunk(text, strategy="sentence", target_tokens=300)
# Fixed strategy — sliding window with overlap
chunks = chunk(text, strategy="fixed", target_tokens=500, overlap_tokens=50)
# Check token estimates
for i, c in enumerate(chunks):
print(f"Chunk {i}: {estimate_tokens(c)} tokens")There’s also an LLM-driven strategy for semantic chunking:
# LLM strategy — GPT-4o groups sentences by topic (requires API key)
chunks = chunk(text, strategy="llm", api_key="sk-...")And an async version:
from distillcore import achunk
chunks = await achunk(text, strategy="paragraph", target_tokens=500)First Pipeline Run
Process a file
from distillcore import process_document
result = process_document("report.pdf")
print(result.document.metadata.document_type) # e.g., "report"
print(result.document.metadata.document_title) # e.g., "Q4 Financial Report"
print(f"Sections: {len(result.document.sections)}")
print(f"Chunks: {len(result.chunks)}")
print(f"Coverage: {result.validation.end_to_end_coverage:.1%}")Process raw text
from distillcore import process_text, DistillConfig
result = process_text(
"The court finds that the defendant...",
config=DistillConfig(openai_api_key="sk-..."),
)With embeddings
from distillcore import process_document
result = process_document(
"report.pdf",
embed=True, # uses OpenAI text-embedding-3-small by default
)
# Each chunk now has an embedding vector
print(len(result.chunks[0].embedding)) # e.g., 1536Save and search
from distillcore import Store
from distillcore.llm.client import embed_texts
store = Store()
doc_id = store.save(result)
# Generate a query embedding and search
query_emb = embed_texts(["financial reporting requirements"])[0]
results = store.search(query_emb, top_k=5)
for r in results:
print(r["score"], r["text"][:100])Using the Agent Layer
from distillcore_agents import Orchestrator
async with Orchestrator(openai_api_key="sk-...") as orch:
result = await orch.process_one("contract.pdf")
print(result.triage.preset) # "legal"
print(result.processing.chunk_count) # 24
print(result.qa.verified) # TrueNext Steps
- Configuration — Customize chunk sizes, models, and thresholds
- Chunking — Standalone chunking API with 4 strategies
- Presets — Domain-specific processing (generic, legal)
- Extractors — Supported file formats and custom extractors