distillcore
Skip to Content
Getting Started

Getting Started

Installation

# Core library (chunking, extraction, validation, storage — no API key needed) pip install distillcore # With LLM features (classification, structuring, enrichment, OpenAI embeddings) pip install distillcore[openai] # With file format support pip install distillcore[pdf] # PDF extraction pip install distillcore[docx] # DOCX extraction pip install distillcore[html] # HTML extraction # Everything pip install distillcore[all]

For alternative embedding providers:

pip install distillcore[local] # sentence-transformers pip install distillcore[cohere] # Cohere embeddings

For the agent layer:

pip install distillcore-agents

Prerequisites

  • Python 3.11+
  • An OpenAI API key is needed for LLM-powered stages (classification, structuring, enrichment) and OpenAI embeddings. Not required for standalone chunking, text extraction, validation, or storage.

Set your API key:

export OPENAI_API_KEY="sk-..."

Or pass it in config:

from distillcore import process_document, DistillConfig config = DistillConfig(openai_api_key="sk-...") result = process_document("report.pdf", config=config)

Standalone Chunking (No API Key)

The fastest way to get started — chunk text without any LLM calls:

from distillcore import chunk, estimate_tokens text = open("document.txt").read() # Paragraph strategy (default) — splits on paragraph boundaries chunks = chunk(text, strategy="paragraph", target_tokens=500) # Sentence strategy — splits on sentence boundaries chunks = chunk(text, strategy="sentence", target_tokens=300) # Fixed strategy — sliding window with overlap chunks = chunk(text, strategy="fixed", target_tokens=500, overlap_tokens=50) # Check token estimates for i, c in enumerate(chunks): print(f"Chunk {i}: {estimate_tokens(c)} tokens")

There’s also an LLM-driven strategy for semantic chunking:

# LLM strategy — GPT-4o groups sentences by topic (requires API key) chunks = chunk(text, strategy="llm", api_key="sk-...")

And an async version:

from distillcore import achunk chunks = await achunk(text, strategy="paragraph", target_tokens=500)

First Pipeline Run

Process a file

from distillcore import process_document result = process_document("report.pdf") print(result.document.metadata.document_type) # e.g., "report" print(result.document.metadata.document_title) # e.g., "Q4 Financial Report" print(f"Sections: {len(result.document.sections)}") print(f"Chunks: {len(result.chunks)}") print(f"Coverage: {result.validation.end_to_end_coverage:.1%}")

Process raw text

from distillcore import process_text, DistillConfig result = process_text( "The court finds that the defendant...", config=DistillConfig(openai_api_key="sk-..."), )

With embeddings

from distillcore import process_document result = process_document( "report.pdf", embed=True, # uses OpenAI text-embedding-3-small by default ) # Each chunk now has an embedding vector print(len(result.chunks[0].embedding)) # e.g., 1536
from distillcore import Store from distillcore.llm.client import embed_texts store = Store() doc_id = store.save(result) # Generate a query embedding and search query_emb = embed_texts(["financial reporting requirements"])[0] results = store.search(query_emb, top_k=5) for r in results: print(r["score"], r["text"][:100])

Using the Agent Layer

from distillcore_agents import Orchestrator async with Orchestrator(openai_api_key="sk-...") as orch: result = await orch.process_one("contract.pdf") print(result.triage.preset) # "legal" print(result.processing.chunk_count) # 24 print(result.qa.verified) # True

Next Steps

  • Configuration — Customize chunk sizes, models, and thresholds
  • Chunking — Standalone chunking API with 4 strategies
  • Presets — Domain-specific processing (generic, legal)
  • Extractors — Supported file formats and custom extractors