UnstructuredFileConverter
Use this component to convert text files and directories to a Document.
Name | UnstructuredFileConverter |
Path | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured |
Position in a Pipeline | Before PreProcessors, or right at the beginning of an indexing Pipeline |
Mandatory Inputs | “paths”: a union of lists of paths |
Outputs | “documents: a list of Documents |
Overview
UnstructuredFileConverter
converts files and directories into Documents using the Unstructured API.
Unstructured provides a series of tools to do ETL for LLMs. The UnstructuredFileConverter
calls the Unstructured API that extracts text and other information from a vast range of file formats.
Usage
If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variable UNSTRUCTURED_API_KEY
:
export UNSTRUCTURED_API_KEY=your_api_key
On its own
import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"])
In a Pipeline
import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
document_store = InMemoryDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})
Updated 9 months ago