Skip to main content
Version: 2.27-unstable

KreuzbergConverter

KreuzbergConverter converts files to Haystack Documents using Kreuzberg, a document intelligence framework with a Rust core that extracts text from 91+ file formats entirely locally with no external API calls.

Most common position in a pipelineBefore PreProcessors, or right at the beginning of an indexing pipeline
Mandatory run variablessources: A list of file paths, directory paths, or ByteStream objects
Output variablesdocuments: A list of documents
API referenceKreuzberg
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/kreuzberg

Overview

The KreuzbergConverter takes a list of file paths, directory paths, or ByteStream objects and uses Kreuzberg to extract text and metadata. All processing is performed locally with no external API calls.

Supported format categories:

  • Documents: PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODS, ODP, RTF, Pages, Keynote, Numbers, and more
  • Images (via OCR): PNG, JPEG, TIFF, GIF, BMP, WebP, JPEG 2000, SVG
  • Text/Markup: Markdown, HTML, XML, LaTeX, Typst, JSON, YAML, reStructuredText, Jupyter notebooks
  • Email: EML, MSG (with attachment extraction)
  • Archives: ZIP, TAR, GZIP, 7Z (extracts and processes contents recursively)
  • eBooks & Academic: EPUB, BibTeX, DocBook, JATS

The component returns one Haystack Document per source by default. When per-page extraction or chunking is enabled, it returns one Document per page or chunk instead. Documents include rich metadata such as quality scores, detected languages, extracted keywords, table data, and PDF annotations.

By default, batch processing is enabled, leveraging Rust's rayon thread pool for parallel extraction. Set batch=False for sequential processing.

You can customize extraction behavior with Kreuzberg's ExtractionConfig, either passed directly or loaded from a TOML, YAML, or JSON configuration file via config_path. See the Kreuzberg documentation for the full configuration reference.

Usage

Install the Kreuzberg integration:

shell
pip install kreuzberg-haystack

On its own

python
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

converter = KreuzbergConverter()
result = converter.run(sources=["report.pdf", "notes.docx"])
documents = result["documents"]

In a pipeline

python
from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": ["report.pdf", "presentation.pptx"]}})

Additional Features

Markdown Output with OCR

Use ExtractionConfig to customize the output format and OCR settings:

python
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
from kreuzberg import ExtractionConfig, OcrConfig

converter = KreuzbergConverter(
config=ExtractionConfig(
output_format="markdown",
ocr=OcrConfig(backend="tesseract", language="eng"),
),
)
result = converter.run(sources=["scanned_document.pdf"])
documents = result["documents"]

Per-Page Extraction

Create one Document per page using PageConfig:

python
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
from kreuzberg import ExtractionConfig, PageConfig

converter = KreuzbergConverter(
config=ExtractionConfig(
page=PageConfig(extract_pages=True),
),
)
result = converter.run(sources=["multipage.pdf"])
# One Document per page, each with page_number in metadata

Token Reduction

Reduce output size for LLM consumption with TokenReductionConfig. Token reduction uses TF-IDF-based extractive summarization to identify and preserve the most important terms and phrases, progressively removing less critical content such as extra whitespace, filler words, and redundant phrases. Five levels are available: "off" (no reduction), "light" (~15%), "moderate" (~30%), "aggressive" (~50%), and "maximum" (>50% reduction):

python
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
from kreuzberg import ExtractionConfig, TokenReductionConfig

converter = KreuzbergConverter(
config=ExtractionConfig(
token_reduction=TokenReductionConfig(mode="moderate"),
),
)

Config from File

Load extraction settings from a TOML, YAML, or JSON file:

python
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

converter = KreuzbergConverter(config_path="extraction_config.toml")

For the full configuration reference and format support matrix, see the Kreuzberg documentation.