DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
Documentation

MultiFileConverter

Converts CSV, DOCX, HTML, JSON, MD, PPTX, PDF, TXT, and XSLX files to documents.

Most common position in a pipelineBefore PreProcessors , or right at the beginning of an indexing pipeline
Mandatory run variables"sources": A list of file paths or ByteStream objects
Output variables"documents": A list of converted documents

"unclassified": A list of uncategorized file paths or byte streams
API referenceConverters
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/multi_file_converter.py

Overview

MultiFileConverter converts input files of various file types into documents.

It is a SuperComponent that combines a FileTypeRouter, nine converters and a DocumentJoiner into a single component.

Parameters

To initialize MultiFileConverter, there are no mandatory parameters. Optionally, you can provide encoding and json_content_key parameters.

The json_content_key parameter lets you specify for the JSON files which key in the extracted data will be the document's content. The parameter is passed on to the underlying JSONConverter component.

The encoding parameter lets you specify the default encoding of the TXT, CSV, and MD files. If you don't provide any value, the component uses utf-8 by default. Note that if the encoding is specified in the metadata of an input ByteStream, it will override this parameter's setting. The parameter is passed on to the underlying TextFileToDocument and CSVToDocument components.

Usage

Install dependencies for all supported file types to use the MultiFileConverter:

pip install pypdf markdown-it-py  mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas

On its own

from haystack.components.converters import MultiFileConverter

converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})

In a pipeline

You can also use MultiFileConverter in your indexing pipeline.

from haystack import Pipeline
from haystack.components.converters import MultiFileConverter
from haystack.components.preprocessors import DocumentPreprocessor
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", MultiFileConverter())
pipeline.add_component("preprocessor", DocumentPreprocessor())
pipeline.add_component("writer", DocumentWriter(document_store = document_store))
pipeline.connect("converter", "preprocessor")
pipeline.connect("preprocessor", "writer")

result = pipeline.run(data={"sources": ["test.txt", "test.pdf"]})

print(result)
# {'writer': {'documents_written': 3}}