Document processing powered by AI agents
A 7-stage pipeline that extracts, classifies, structures, chunks, enriches, embeds, and validates documents. An autonomous 4-agent layer handles triage, processing, quality assurance, and research.
$ pip install distillcore # core library $ pip install distillcore-agents # optional agent layer (early)
MIT licensed · open source · self-hosted
7-stage pipeline
Each stage validates its output before passing to the next. Coverage thresholds ensure nothing is lost in translation.
Extract
Pull text from PDF, DOCX, HTML, TXT, and Markdown files. OCR fallback for scanned documents.
Classify
Identify document type, title, author, and domain-specific metadata using LLM analysis.
Structure
Parse document structure into hierarchical sections with headings, body text, and tables.
Chunk
Split into semantic chunks with configurable target size and character overlap for retrieval.
Enrich
Add topic labels, key concepts, and relevance scores to each chunk via LLM enrichment.
Embed
Generate vector embeddings with OpenAI, Ollama, local sentence-transformers, or Cohere.
Validate
Verify text coverage, chunk completeness, and end-to-end quality at each stage boundary.
Built for production
Everything you need to process documents at scale, from extraction to semantic search.
5 file formats
PDF, DOCX, HTML, TXT, and Markdown with a pluggable extractor protocol for custom formats.
4 embedding providers
OpenAI, Ollama, local sentence-transformers, and Cohere. Bring your own embedding function.
Domain presets
Generic and legal presets out of the box. Create custom presets with your own LLM prompts.
SQLite storage
Cosine similarity search, tenant isolation, and full document lifecycle in a single file.
Async & batch
process_document_async, process_batch with configurable max_concurrent for throughput.
4-agent layer
Triage, Processing, QA, and Research agents with autonomous orchestration and streaming.
Security hardened
Path traversal prevention, prompt injection hardening, tenant isolation, config validation.
Tested & published
Comprehensive test suite across both packages with CI/CD pipelines. Published to PyPI.
Ready to get started?
Install distillcore and process your first document in under a minute.