distillcore
Skip to Content
Chunking

Chunking

distillcore provides a standalone chunking API that works without any LLM calls or API keys (except the llm strategy). Import chunk and go.

Strategies

Paragraph (default)

Splits on paragraph boundaries (\n\n). Oversized paragraphs are subsplit using cascading strategies: line breaks → sentence boundaries → hard cut at word boundary. Supports overlap.

from distillcore import chunk chunks = chunk(text, strategy="paragraph", target_tokens=500, overlap_tokens=50)

Best for: general documents, articles, reports.

Sentence

Splits on sentence boundaries (.!? followed by whitespace and a capital letter), then greedily fills chunks to the target size.

chunks = chunk(text, strategy="sentence", target_tokens=300)

Best for: documents where sentence integrity matters (legal, academic).

Fixed

Pure sliding window at word boundaries. Produces uniform chunk sizes with configurable overlap.

chunks = chunk(text, strategy="fixed", target_tokens=500, overlap_tokens=50)

Best for: uniform chunk sizes for embedding models with fixed context windows.

LLM

Sends numbered sentences to an LLM (GPT-4o by default) which groups them into semantically coherent chunks. Large documents (>300 sentences) are processed in overlapping windows. Falls back to paragraph strategy on API error.

chunks = chunk(text, strategy="llm", api_key="sk-...", target_tokens=500)

Best for: documents where topic boundaries matter and you want the highest-quality semantic chunks.

Requires distillcore[openai].

Full API

from distillcore import chunk chunks = chunk( text, # required strategy="paragraph", # "paragraph", "sentence", "fixed", "llm" target_tokens=500, # target chunk size in tokens max_tokens=1000, # maximum chunk size (hard ceiling) overlap_tokens=50, # overlap between consecutive chunks min_tokens=0, # merge chunks below this threshold (0 = disabled) tokenizer=None, # custom Callable[[str], int] for token counting api_key="", # OpenAI API key (for strategy="llm") model="gpt-4o", # LLM model (for strategy="llm") )

Returns list[str] — a list of text chunks.

Async Version

from distillcore import achunk chunks = await achunk(text, strategy="paragraph", target_tokens=500)

Only the llm strategy is truly async (makes async API calls). Other strategies run synchronously and return immediately.

Token Estimation

from distillcore import estimate_tokens tokens = estimate_tokens("some text") # len(text) // 4 by default

Pass a custom tokenizer for accurate counts:

import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") tokens = estimate_tokens("some text", tokenizer=lambda s: len(enc.encode(s))) chunks = chunk(text, target_tokens=500, tokenizer=lambda s: len(enc.encode(s)))

Small Chunk Merging

Set min_tokens to merge chunks that are too small into their neighbors:

chunks = chunk(text, target_tokens=500, min_tokens=100)

Chunks below min_tokens are merged into the previous chunk. A trailing small chunk is merged into the second-to-last.

Pipeline Integration

When using the full pipeline (process_document / process_text), chunking is controlled via ChunkConfig:

from distillcore import process_document, DistillConfig, ChunkConfig config = DistillConfig( chunk=ChunkConfig( strategy="auto", # "auto" uses section/transcript/fallback logic target_tokens=500, overlap_chars=200, max_tokens=1000, min_tokens=50, ), ) result = process_document("report.pdf", config=config)

The "auto" strategy (default) uses the pipeline’s section-aware chunking:

  • Transcripts with >50% turn coverage → group turns by target size
  • Documents with sections → one chunk per section, split large sections
  • No sections → split full text on paragraph boundaries

Named strategies ("paragraph", "sentence", "fixed", "llm") delegate directly to the standalone chunk() API.