Introduction

distillcore is a Python library for intelligent document processing. It provides a 7-stage pipeline that takes raw files and produces structured, chunked, enriched, and embedded documents ready for retrieval and analysis.

distillcore-agents is a companion library that adds an autonomous 4-agent orchestration layer on top of the pipeline, handling triage, processing, quality assurance, and research.

The Pipeline

Every document flows through seven stages:

Extract — Pull text from PDF, DOCX, HTML, TXT, or Markdown
Classify — Identify document type, title, and metadata via LLM
Structure — Parse into hierarchical sections
Chunk — Split into semantic chunks with overlap
Enrich — Add topics, key concepts, and relevance scores
Embed — Generate vector embeddings
Validate — Verify coverage at each stage boundary

The Agent Layer

Four specialized agents work together:

Agent	Role
Triage	Assesses documents, selects preset and configuration
Processing	Executes the distillcore pipeline with triage config
QA	Validates coverage thresholds and chunk quality
Research	Searches stored documents, synthesizes answers with citations

Quick Example


from distillcore import process_document, chunk
 
# Full pipeline
result = process_document("report.pdf")
print(result.document.metadata.document_type)
print(f"{len(result.chunks)} chunks")
 
# Or just chunk text — no LLM, no API key needed
chunks = chunk("Your text here...", strategy="paragraph", target_tokens=500)

Installation


# Core library (chunking, extraction, validation, storage)
pip install distillcore
 
# With LLM features (classification, structuring, enrichment, OpenAI embeddings)
pip install distillcore[openai]
 
# Everything
pip install distillcore[all]
 
# Agent layer
pip install distillcore-agents

Next Steps

Getting Started — Install and run your first pipeline
Configuration — Customize pipeline behavior
API Reference — Full function signatures
Agent Overview — Autonomous document processing