MultiFileConverter
Converts CSV, DOCX, HTML, JSON, MD, PPTX, PDF, TXT, and XSLX files to documents.
Most common position in a pipeline | Before PreProcessors , or right at the beginning of an indexing pipeline |
Mandatory run variables | "sources": A list of file paths or ByteStream objects |
Output variables | "documents": A list of converted documents "unclassified": A list of uncategorized file paths or byte streams |
API reference | Converters |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/multi_file_converter.py |
Overview
MultiFileConverter
converts input files of various file types into documents.
It is a SuperComponent that combines a FileTypeRouter
, nine converters and a DocumentJoiner
into a single component.
Parameters
To initialize MultiFileConverter
, there are no mandatory parameters. Optionally, you can provide encoding
and json_content_key
parameters.
The json_content_key
parameter lets you specify for the JSON files which key in the extracted data will be the document's content. The parameter is passed on to the underlying JSONConverter
component.
The encoding
parameter lets you specify the default encoding of the TXT, CSV, and MD files. If you don't provide any value, the component uses utf-8
by default. Note that if the encoding is specified in the metadata of an input ByteStream, it will override this parameter's setting. The parameter is passed on to the underlying TextFileToDocument
and CSVToDocument
components.
Usage
Install dependencies for all supported file types to use the MultiFileConverter
:
pip install pypdf markdown-it-py mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas
On its own
from haystack.components.converters import MultiFileConverter
converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})
In a pipeline
You can also use MultiFileConverter
in your indexing pipeline.
from haystack import Pipeline
from haystack.components.converters import MultiFileConverter
from haystack.components.preprocessors import DocumentPreprocessor
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", MultiFileConverter())
pipeline.add_component("preprocessor", DocumentPreprocessor())
pipeline.add_component("writer", DocumentWriter(document_store = document_store))
pipeline.connect("converter", "preprocessor")
pipeline.connect("preprocessor", "writer")
result = pipeline.run(data={"sources": ["test.txt", "test.pdf"]})
print(result)
# {'writer': {'documents_written': 3}}
Updated 21 days ago