DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

UnstructuredFileConverter

Use this component to convert text files and directories to a document.

NameUnstructuredFileConverter
Sourcehttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured
Most common position in a pipelineBefore PreProcessors or right at the beginning of an indexing pipeline
Mandatory input variablesβ€œpaths”: A union of lists of paths
Output variablesβ€œdocuments: A list of documents

Overview

UnstructuredFileConverter converts files and directories into documents using the Unstructured API.

UnstructuredΒ provides a series of tools to doΒ ETL for LLMs. The UnstructuredFileConverter calls the Unstructured API that extracts text and other information from a vast range of file formats.

Usage

If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variableΒ UNSTRUCTURED_API_KEY:

export UNSTRUCTURED_API_KEY=your_api_key

On its own

import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"])

In a pipeline

import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

document_store = InMemoryDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")

indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})

Related Links

Check out the API reference in the GitHub repo or in our docs: