distillcore
Skip to Content
Extractors

Extractors

Extractors convert raw files into text. distillcore ships with four built-in extractors and supports custom extractors via a protocol.

Built-in Extractors

ExtractorFormatsDependency
TextExtractortxt, text, md, markdownNone (stdlib)
PdfExtractorpdfpdfplumber
DocxExtractordocxpython-docx
HtmlExtractorhtml, htmbeautifulsoup4

All extractors are registered automatically on import.

Using the Extract API

from distillcore import extract result = extract("report.pdf") print(result.full_text[:200]) print(f"{result.page_count} pages") # With explicit format result = extract("data.txt", format="markdown")

The extract() function auto-detects format from the file extension. You can override with the format parameter.

Extractor Protocol

Custom extractors implement two attributes:

from distillcore import register_extractor class CsvExtractor: formats = ["csv", "tsv"] def extract(self, source, config=None): # source is a Path object text = source.read_text() pages_text = [{"page_number": 1, "text": text}] return ExtractionResult( full_text=text, pages_text=pages_text, page_count=1, ) register_extractor(CsvExtractor())

After registration, extract("data.csv") and process_document("data.csv") will use your extractor.

Format Detection

extract() detects formats by file extension. The mapping is:

  • .pdf → pdf
  • .docx → docx
  • .html, .htm → html
  • .txt, .text → txt
  • .md, .markdown → markdown

Unrecognized extensions raise ValueError unless you specify format= explicitly.