Chunking
distillcore provides a standalone chunking API that works without any LLM calls or API keys (except the llm strategy). Import chunk and go.
Strategies
Paragraph (default)
Splits on paragraph boundaries (\n\n). Oversized paragraphs are subsplit using cascading strategies: line breaks → sentence boundaries → hard cut at word boundary. Supports overlap.
from distillcore import chunk
chunks = chunk(text, strategy="paragraph", target_tokens=500, overlap_tokens=50)Best for: general documents, articles, reports.
Sentence
Splits on sentence boundaries (.!? followed by whitespace and a capital letter), then greedily fills chunks to the target size.
chunks = chunk(text, strategy="sentence", target_tokens=300)Best for: documents where sentence integrity matters (legal, academic).
Fixed
Pure sliding window at word boundaries. Produces uniform chunk sizes with configurable overlap.
chunks = chunk(text, strategy="fixed", target_tokens=500, overlap_tokens=50)Best for: uniform chunk sizes for embedding models with fixed context windows.
LLM
Sends numbered sentences to an LLM (GPT-4o by default) which groups them into semantically coherent chunks. Large documents (>300 sentences) are processed in overlapping windows. Falls back to paragraph strategy on API error.
chunks = chunk(text, strategy="llm", api_key="sk-...", target_tokens=500)Best for: documents where topic boundaries matter and you want the highest-quality semantic chunks.
Requires distillcore[openai].
Full API
from distillcore import chunk
chunks = chunk(
text, # required
strategy="paragraph", # "paragraph", "sentence", "fixed", "llm"
target_tokens=500, # target chunk size in tokens
max_tokens=1000, # maximum chunk size (hard ceiling)
overlap_tokens=50, # overlap between consecutive chunks
min_tokens=0, # merge chunks below this threshold (0 = disabled)
tokenizer=None, # custom Callable[[str], int] for token counting
api_key="", # OpenAI API key (for strategy="llm")
model="gpt-4o", # LLM model (for strategy="llm")
)Returns list[str] — a list of text chunks.
Async Version
from distillcore import achunk
chunks = await achunk(text, strategy="paragraph", target_tokens=500)Only the llm strategy is truly async (makes async API calls). Other strategies run synchronously and return immediately.
Token Estimation
from distillcore import estimate_tokens
tokens = estimate_tokens("some text") # len(text) // 4 by defaultPass a custom tokenizer for accurate counts:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = estimate_tokens("some text", tokenizer=lambda s: len(enc.encode(s)))
chunks = chunk(text, target_tokens=500, tokenizer=lambda s: len(enc.encode(s)))Small Chunk Merging
Set min_tokens to merge chunks that are too small into their neighbors:
chunks = chunk(text, target_tokens=500, min_tokens=100)Chunks below min_tokens are merged into the previous chunk. A trailing small chunk is merged into the second-to-last.
Pipeline Integration
When using the full pipeline (process_document / process_text), chunking is controlled via ChunkConfig:
from distillcore import process_document, DistillConfig, ChunkConfig
config = DistillConfig(
chunk=ChunkConfig(
strategy="auto", # "auto" uses section/transcript/fallback logic
target_tokens=500,
overlap_chars=200,
max_tokens=1000,
min_tokens=50,
),
)
result = process_document("report.pdf", config=config)The "auto" strategy (default) uses the pipeline’s section-aware chunking:
- Transcripts with >50% turn coverage → group turns by target size
- Documents with sections → one chunk per section, split large sections
- No sections → split full text on paragraph boundaries
Named strategies ("paragraph", "sentence", "fixed", "llm") delegate directly to the standalone chunk() API.