Version: 2.31

MarkItDownConverter

A component that converts files to Documents using Microsoft's MarkItDown library.


Most common position in a pipeline	Before PreProcessors or right at the beginning of an indexing pipeline
Mandatory run variables	`sources`: File paths or `ByteStream` objects
Output variables	`documents`: A list of documents
API reference	MarkItDown
GitHub link	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/markitdown
Package name	`markitdown-haystack`

Overview

MarkItDownConverter converts files into Haystack Documents using Microsoft's MarkItDown library. MarkItDown converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, and more. All processing is performed locally without relying on external APIs.

The converter accepts file paths or ByteStream objects as input and outputs the converted result as a list of Documents. You can attach metadata to the Documents through the meta input parameter.

note

This component returns Markdown content. Avoid piping it through DocumentCleaner() with its default settings because remove_extra_whitespaces=True and remove_empty_lines=True can collapse line breaks and flatten headings, tables, lists, and image tags. Connect the converter directly to your next component, or disable those options if you need custom cleanup.

Usage

Install the MarkItDown integration:

shell

pip install markitdown-haystack

On its own

python

from haystack_integrations.components.converters.markitdown import MarkItDownConverter

converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]

In a pipeline

python

from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.markitdown import MarkItDownConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", MarkItDownConverter())
pipeline.add_component(
    "splitter",
    DocumentSplitter(split_by="sentence", split_length=5),
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": ["document.pdf", "report.docx"]}})

Overview​

Usage​

On its own​

In a pipeline​

Overview

Usage

On its own

In a pipeline