Introduction
distillcore is a Python library for intelligent document processing. It provides a 7-stage pipeline that takes raw files and produces structured, chunked, enriched, and embedded documents ready for retrieval and analysis.
distillcore-agents is a companion library that adds an autonomous 4-agent orchestration layer on top of the pipeline, handling triage, processing, quality assurance, and research.
The Pipeline
Every document flows through seven stages:
- Extract — Pull text from PDF, DOCX, HTML, TXT, or Markdown
- Classify — Identify document type, title, and metadata via LLM
- Structure — Parse into hierarchical sections
- Chunk — Split into semantic chunks with overlap
- Enrich — Add topics, key concepts, and relevance scores
- Embed — Generate vector embeddings
- Validate — Verify coverage at each stage boundary
The Agent Layer
Four specialized agents work together:
| Agent | Role |
|---|---|
| Triage | Assesses documents, selects preset and configuration |
| Processing | Executes the distillcore pipeline with triage config |
| QA | Validates coverage thresholds and chunk quality |
| Research | Searches stored documents, synthesizes answers with citations |
Quick Example
from distillcore import process_document, chunk
# Full pipeline
result = process_document("report.pdf")
print(result.document.metadata.document_type)
print(f"{len(result.chunks)} chunks")
# Or just chunk text — no LLM, no API key needed
chunks = chunk("Your text here...", strategy="paragraph", target_tokens=500)Installation
# Core library (chunking, extraction, validation, storage)
pip install distillcore
# With LLM features (classification, structuring, enrichment, OpenAI embeddings)
pip install distillcore[openai]
# Everything
pip install distillcore[all]
# Agent layer
pip install distillcore-agentsNext Steps
- Getting Started — Install and run your first pipeline
- Configuration — Customize pipeline behavior
- API Reference — Full function signatures
- Agent Overview — Autonomous document processing