Skip to main content
Version: 2.25

Markitdown

haystack_integrations.components.converters.markitdown.markitdown_converter

MarkItDownConverter

Converts files to Haystack Documents using MarkItDown.

MarkItDown is a Microsoft library that converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, images, audio, and more. All processing is performed locally.

Usage example

python
from haystack_integrations.components.converters.markitdown import MarkItDownConverter

converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]

init

python
__init__(store_full_path: bool = False) -> None

Initializes the MarkItDownConverter.

Parameters:

  • store_full_path (bool) – If True, the full file path is stored in the Document metadata. If False, only the file name is stored. Defaults to False.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]

Converts files to Documents using MarkItDown.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. Can be a single dict applied to all Documents, or a list of dicts aligned with sources.

Returns:

  • dict[str, list[Document]] – A dictionary with key documents containing the converted Documents.