MarkItDownConverter
A component that converts files to Documents using Microsoft's MarkItDown library.
| Most common position in a pipeline | Before PreProcessors or right at the beginning of an indexing pipeline |
| Mandatory run variables | sources: File paths or ByteStream objects |
| Output variables | documents: A list of documents |
| API reference | MarkItDown |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/markitdown |
Overview
MarkItDownConverter converts files into Haystack Documents using Microsoft's MarkItDown library. MarkItDown converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, and more. All processing is performed locally without relying on external APIs.
The converter accepts file paths or ByteStream objects as input and outputs the converted result as a list of Documents. You can attach metadata to the Documents through the meta input parameter.
Usage
Install the MarkItDown integration:
On its own
python
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]
In a pipeline
python
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.markitdown import MarkItDownConverter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", MarkItDownConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component(
"splitter",
DocumentSplitter(split_by="sentence", split_length=5),
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")
pipeline.run({"converter": {"sources": ["document.pdf", "report.docx"]}})