Skip to main content
Version: 2.25

Kreuzberg

haystack_integrations.components.converters.kreuzberg.converter

KreuzbergConverter

Converts files to Documents using Kreuzberg.

Kreuzberg is a document intelligence framework that extracts text from PDFs, Office documents, images, and 75+ other formats. All processing is performed locally with no external API calls.

Usage Example:

python
from haystack_integrations.components.converters.kreuzberg import (
KreuzbergConverter,
)

converter = KreuzbergConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]

You can also pass kreuzberg's ExtractionConfig to customize extraction:

python
from kreuzberg import ExtractionConfig, OcrConfig

converter = KreuzbergConverter(
config=ExtractionConfig(
output_format="markdown",
ocr=OcrConfig(backend="tesseract", language="eng"),
),
)

Token reduction can be configured via ExtractionConfig(token_reduction=TokenReductionConfig(mode="moderate")) to reduce output size for LLM consumption. Five levels are available: "off", "light", "moderate", "aggressive", "maximum". The reduced text appears directly in Document.content.

Image preprocessing for OCR can be tuned via OcrConfig(tesseract_config=TesseractConfig(preprocessing=ImagePreprocessingConfig(...))) with options for target DPI, auto-rotate, deskew, denoise, contrast enhancement, and binarization method.

init

python
__init__(
*,
config: ExtractionConfig | None = None,
config_path: str | Path | None = None,
store_full_path: bool = False,
batch: bool = True,
easyocr_kwargs: dict[str, Any] | None = None
) -> None

Create a KreuzbergConverter component.

Parameters:

  • config (ExtractionConfig | None) – An optional kreuzberg.ExtractionConfig object to customize extraction behavior. Use this to set output format, OCR backend and language, force-OCR mode, per-page extraction, chunking, keyword extraction, and other kreuzberg options. If not provided, kreuzberg's defaults are used. See the kreuzberg API reference for the full list of configuration options.
  • config_path (str | Path | None) – Path to a kreuzberg configuration file (.toml, .yaml, or .json). Cannot be used together with config.
  • store_full_path (bool) – If True, the full file path is stored in the Document metadata. If False, only the file name is stored.
  • batch (bool) – If True, use kreuzberg's batch extraction APIs, which leverage Rust's rayon thread pool for parallel processing. If False, sources are extracted one at a time.
  • easyocr_kwargs (dict[str, Any] | None) – Optional keyword arguments to pass to EasyOCR when using the "easyocr" backend. Supports GPU, beam width, model storage, and other EasyOCR-specific options. See the EasyOCR documentation for the full list of supported arguments.

to_dict

python
to_dict() -> dict[str, Any]

Serialize this component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> KreuzbergConverter

Deserialize this component from a dictionary.

Parameters:

  • data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

  • KreuzbergConverter – Deserialized component.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]

Convert files to Documents using Kreuzberg.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths, directory paths, or ByteStream objects to convert. Directory paths are expanded to their direct file children (non-recursive, sorted alphabetically).
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Note: When directories are present in sources, meta must be a single dictionary (not a list), since the number of files in a directory is not known in advance.

Returns:

  • dict[str, list[Document]] – A dictionary with the following key:

  • documents: A list of created Documents.