DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord

UnstructuredFileConverter

Use this component to convert text files and directories to a Document.

NameUnstructuredFileConverter
Pathhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured
Position in a PipelineBefore PreProcessors, or right at the beginning of an indexing Pipeline
Mandatory Inputs“paths”: a union of lists of paths
Outputs“documents: a list of Documents

Overview

UnstructuredFileConverter converts files and directories into Documents using the Unstructured API.

Unstructured provides a series of tools to do ETL for LLMs. The UnstructuredFileConverter calls the Unstructured API that extracts text and other information from a vast range of file formats.

Usage

If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variable UNSTRUCTURED_API_KEY:

export UNSTRUCTURED_API_KEY=your_api_key

On its own

import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"])

In a Pipeline

import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

document_store = InMemoryDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")

indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})

Related Links

Check out the API reference in the GitHub repo or in our docs: