Skip to main content
Version: 2.26

MarkItDownConverter

A component that converts files to Documents using Microsoft's MarkItDown library.

Most common position in a pipelineBefore PreProcessors or right at the beginning of an indexing pipeline
Mandatory run variablessources: File paths or ByteStream objects
Output variablesdocuments: A list of documents
API referenceMarkItDown
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/markitdown

Overview

MarkItDownConverter converts files into Haystack Documents using Microsoft's MarkItDown library. MarkItDown converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, and more. All processing is performed locally without relying on external APIs.

The converter accepts file paths or ByteStream objects as input and outputs the converted result as a list of Documents. You can attach metadata to the Documents through the meta input parameter.

Usage

Install the MarkItDown integration:

shell
pip install markitdown-haystack

On its own

python
from haystack_integrations.components.converters.markitdown import MarkItDownConverter

converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]

In a pipeline

python
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.markitdown import MarkItDownConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", MarkItDownConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component(
"splitter",
DocumentSplitter(split_by="sentence", split_length=5),
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": ["document.pdf", "report.docx"]}})