Intelligent Document Processing

Document processing powered by AI agents

A 7-stage pipeline that extracts, classifies, structures, chunks, enriches, embeds, and validates documents. An autonomous 4-agent layer handles triage, processing, quality assurance, and research.

$ pip install distillcore         # core library
$ pip install distillcore-agents  # optional agent layer (early)

MIT licensed · open source · self-hosted

extract
classify
structure
chunk
enrich
embed
validate

7-stage pipeline

Each stage validates its output before passing to the next. Coverage thresholds ensure nothing is lost in translation.

1

Extract

Pull text from PDF, DOCX, HTML, TXT, and Markdown files. OCR fallback for scanned documents.

2

Classify

Identify document type, title, author, and domain-specific metadata using LLM analysis.

3

Structure

Parse document structure into hierarchical sections with headings, body text, and tables.

4

Chunk

Split into semantic chunks with configurable target size and character overlap for retrieval.

5

Enrich

Add topic labels, key concepts, and relevance scores to each chunk via LLM enrichment.

6

Embed

Generate vector embeddings with OpenAI, Ollama, local sentence-transformers, or Cohere.

7

Validate

Verify text coverage, chunk completeness, and end-to-end quality at each stage boundary.

Built for production

Everything you need to process documents at scale, from extraction to semantic search.

5 file formats

PDF, DOCX, HTML, TXT, and Markdown with a pluggable extractor protocol for custom formats.

4 embedding providers

OpenAI, Ollama, local sentence-transformers, and Cohere. Bring your own embedding function.

Domain presets

Generic and legal presets out of the box. Create custom presets with your own LLM prompts.

SQLite storage

Cosine similarity search, tenant isolation, and full document lifecycle in a single file.

Async & batch

process_document_async, process_batch with configurable max_concurrent for throughput.

4-agent layer

Triage, Processing, QA, and Research agents with autonomous orchestration and streaming.

Security hardened

Path traversal prevention, prompt injection hardening, tenant isolation, config validation.

Tested & published

Comprehensive test suite across both packages with CI/CD pipelines. Published to PyPI.

Ready to get started?

Install distillcore and process your first document in under a minute.