DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
Documentation

UnstructuredFileConverter

Use this component to convert text files and directories to a document.

NameUnstructuredFileConverter
Sourcehttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured
Most common position in a pipelineBefore PreProcessors or right at the beginning of an indexing pipeline
Mandatory input variables“paths”: A union of lists of paths
Output variables“documents: A list of documents

Overview

UnstructuredFileConverter converts files and directories into documents using the Unstructured API.

Unstructured provides a series of tools to do ETL for LLMs. The UnstructuredFileConverter calls the Unstructured API that extracts text and other information from a vast range of file formats.

This Converter supports different modes for creating documents from the elements returned by Unstructured:

  • "one-doc-per-file": One Haystack document per file. All elements are concatenated into one text field.
  • "one-doc-per-page": One Haystack document per page. All elements on a page are concatenated into one text field.
  • "one-doc-per-element": One Haystack document per element. Each element is converted to a Haystack document.

Usage

There are paid and free versions of Unstructured API: Unstructured Serverless API or Free Unstructured API.

For the Free Unstructured API, the API URL is https://api.unstructured.io/general/v0/general. For the Unstructured Serverless API, find your unique API URL in your Unstructured account.

Note that the API keys for free and paid versions are not interchangeable.

If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variable UNSTRUCTURED_API_KEY:

export UNSTRUCTURED_API_KEY=your_api_key

On its own

import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]

In a pipeline

import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

document_store = InMemoryDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")

indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})

With Docker

To use UnstructuredFileConverter through Docker, first, set up an Unstructured Docker container:

docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0

When initializing the component, specify the localhost URL:

from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter(api_url="http://localhost:8000/general/v0/general")

Related Links

Check out the API reference in the GitHub repo or in our docs: