Markitdown
haystack_integrations.components.converters.markitdown.markitdown_converter
MarkItDownConverter
Converts files to Haystack Documents using MarkItDown.
MarkItDown is a Microsoft library that converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, images, audio, and more. All processing is performed locally.
Usage example
python
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]
init
Initializes the MarkItDownConverter.
Parameters:
- store_full_path (
bool) – IfTrue, the full file path is stored in the Document metadata. IfFalse, only the file name is stored. Defaults toFalse.
run
python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]
Converts files to Documents using MarkItDown.
Parameters:
- sources (
list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert. - meta (
dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. Can be a single dict applied to all Documents, or a list of dicts aligned withsources.
Returns:
dict[str, list[Document]]– A dictionary with keydocumentscontaining the converted Documents.