UnstructuredFileConverter
Use this component to convert text files and directories to a document.
Name | UnstructuredFileConverter |
Source | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured |
Most common position in a pipeline | Before PreProcessors or right at the beginning of an indexing pipeline |
Mandatory input variables | “paths”: A union of lists of paths |
Output variables | “documents: A list of documents |
Overview
UnstructuredFileConverter
converts files and directories into documents using the Unstructured API.
Unstructured provides a series of tools to do ETL for LLMs. The UnstructuredFileConverter
calls the Unstructured API that extracts text and other information from a vast range of file formats.
Usage
If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variable UNSTRUCTURED_API_KEY
:
export UNSTRUCTURED_API_KEY=your_api_key
On its own
import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"])
In a pipeline
import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
document_store = InMemoryDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})
Updated 6 months ago