UnstructuredFileConverter
Use this component to convert text files and directories to a document.
Most common position in a pipeline | Before PreProcessors or right at the beginning of an indexing pipeline |
Mandatory run variables | “paths”: A union of lists of paths |
Output variables | “documents: A list of documents |
API reference | Unstructured |
GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured |
Overview
UnstructuredFileConverter
converts files and directories into documents using the Unstructured API.
Unstructured provides a series of tools to do ETL for LLMs. The UnstructuredFileConverter
calls the Unstructured API that extracts text and other information from a vast range of file formats.
This Converter supports different modes for creating documents from the elements returned by Unstructured:
"one-doc-per-file"
: One Haystack document per file. All elements are concatenated into one text field."one-doc-per-page"
: One Haystack document per page. All elements on a page are concatenated into one text field."one-doc-per-element"
: One Haystack document per element. Each element is converted to a Haystack document.
Usage
Install the Unstructured integration to use UnstructuredFileConverter
component:
pip install unstructured-fileconverter-haystack
There are free and paid versions of Unstructured API: Free Unstructured API and Unstructured Serverless API.
-
Free Unstructured API:
- API URL:
https://api.unstructured.io/general/v0/general
- This version is free, but comes with certain limitations.
- API URL:
-
Unstructured Serverless API:
- You'll find your unique API URL in your Unstructured account after signing up for the paid version.
- This is a full-tier paid version of Unstructured.
For more details about the two tiers refer to Unstructured FAQ.
The API keys for the free and paid versions are different and cannot be used interchangeably.
Regardless of the chosen tier, we recommend to set the Unstructured API key as an environment variable UNSTRUCTURED_API_KEY
:
export UNSTRUCTURED_API_KEY=your_api_key
On its own
import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]
In a pipeline
import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
document_store = InMemoryDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})
With Docker
To use UnstructuredFileConverter
through Docker, first, set up an Unstructured Docker container:
docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
When initializing the component, specify the localhost URL:
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
converter = UnstructuredFileConverter(api_url="http://localhost:8000/general/v0/general")
Updated 4 months ago