UnstructuredFileConverter
Use this component to convert text files and directories to a document.
Name | UnstructuredFileConverter |
Source | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured |
Most common position in a pipeline | Before PreProcessors or right at the beginning of an indexing pipeline |
Mandatory input variables | “paths”: A union of lists of paths |
Output variables | “documents: A list of documents |
Overview
UnstructuredFileConverter
converts files and directories into documents using the Unstructured API.
Unstructured provides a series of tools to do ETL for LLMs. The UnstructuredFileConverter
calls the Unstructured API that extracts text and other information from a vast range of file formats.
This Converter supports different modes for creating documents from the elements returned by Unstructured:
"one-doc-per-file"
: One Haystack document per file. All elements are concatenated into one text field."one-doc-per-page"
: One Haystack document per page. All elements on a page are concatenated into one text field."one-doc-per-element"
: One Haystack document per element. Each element is converted to a Haystack document.
Usage
There are paid and free versions of Unstructured API: Unstructured Serverless API or Free Unstructured API.
For the Free Unstructured API, the API URL is https://api.unstructured.io/general/v0/general
. For the Unstructured Serverless API, find your unique API URL in your Unstructured account.
Note that the API keys for free and paid versions are not interchangeable.
If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variable UNSTRUCTURED_API_KEY
:
export UNSTRUCTURED_API_KEY=your_api_key
On its own
import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]
In a pipeline
import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
document_store = InMemoryDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})
With Docker
To use UnstructuredFileConverter
through Docker, first, set up an Unstructured Docker container:
docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
When initializing the component, specify the localhost URL:
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
converter = UnstructuredFileConverter(api_url="http://localhost:8000/general/v0/general")
Updated 5 months ago