MarkItDownConverter
A component that converts files to Documents using Microsoft's MarkItDown library.
| Most common position in a pipeline | Before PreProcessors or right at the beginning of an indexing pipeline |
| Mandatory run variables | sources: File paths or ByteStream objects |
| Output variables | documents: A list of documents |
| API reference | MarkItDown |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/markitdown |
Overview
MarkItDownConverter converts files into Haystack Documents using Microsoft's MarkItDown library. MarkItDown converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, and more. All processing is performed locally without relying on external APIs.
The converter accepts file paths or ByteStream objects as input and outputs the converted result as a list of Documents. You can attach metadata to the Documents through the meta input parameter.
This component returns Markdown content. Avoid piping it through DocumentCleaner() with its default settings because remove_extra_whitespaces=True and remove_empty_lines=True can collapse line breaks and flatten headings, tables, lists, and image tags. Connect the converter directly to your next component, or disable those options if you need custom cleanup.
Usage
Install the MarkItDown integration:
On its own
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]
In a pipeline
from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", MarkItDownConverter())
pipeline.add_component(
"splitter",
DocumentSplitter(split_by="sentence", split_length=5),
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "splitter")
pipeline.connect("splitter", "writer")
pipeline.run({"converter": {"sources": ["document.pdf", "report.docx"]}})