distillcore
Skip to Content
Introduction

Introduction

distillcore is a Python library for intelligent document processing. It provides a 7-stage pipeline that takes raw files and produces structured, chunked, enriched, and embedded documents ready for retrieval and analysis.

distillcore-agents is a companion library that adds an autonomous 4-agent orchestration layer on top of the pipeline, handling triage, processing, quality assurance, and research.

The Pipeline

Every document flows through seven stages:

  1. Extract — Pull text from PDF, DOCX, HTML, TXT, or Markdown
  2. Classify — Identify document type, title, and metadata via LLM
  3. Structure — Parse into hierarchical sections
  4. Chunk — Split into semantic chunks with overlap
  5. Enrich — Add topics, key concepts, and relevance scores
  6. Embed — Generate vector embeddings
  7. Validate — Verify coverage at each stage boundary

The Agent Layer

Four specialized agents work together:

AgentRole
TriageAssesses documents, selects preset and configuration
ProcessingExecutes the distillcore pipeline with triage config
QAValidates coverage thresholds and chunk quality
ResearchSearches stored documents, synthesizes answers with citations

Quick Example

from distillcore import process_document, chunk # Full pipeline result = process_document("report.pdf") print(result.document.metadata.document_type) print(f"{len(result.chunks)} chunks") # Or just chunk text — no LLM, no API key needed chunks = chunk("Your text here...", strategy="paragraph", target_tokens=500)

Installation

# Core library (chunking, extraction, validation, storage) pip install distillcore # With LLM features (classification, structuring, enrichment, OpenAI embeddings) pip install distillcore[openai] # Everything pip install distillcore[all] # Agent layer pip install distillcore-agents

Next Steps