DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
Documentation

UnstructuredFileConverter

Use this component to convert text files and directories to a document.

Most common position in a pipelineBefore PreProcessors or right at the beginning of an indexing pipeline
Mandatory run variables“paths”: A union of lists of paths
Output variables“documents: A list of documents
API referenceUnstructured
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured

Overview

UnstructuredFileConverter converts files and directories into documents using the Unstructured API.

Unstructured provides a series of tools to do ETL for LLMs. The UnstructuredFileConverter calls the Unstructured API that extracts text and other information from a vast range of file formats.

This Converter supports different modes for creating documents from the elements returned by Unstructured:

  • "one-doc-per-file": One Haystack document per file. All elements are concatenated into one text field.
  • "one-doc-per-page": One Haystack document per page. All elements on a page are concatenated into one text field.
  • "one-doc-per-element": One Haystack document per element. Each element is converted to a Haystack document.

Usage

Install the Unstructured integration to use UnstructuredFileConvertercomponent:

pip install unstructured-fileconverter-haystack

There are free and paid versions of Unstructured API: Free Unstructured API and Unstructured Serverless API.

  1. Free Unstructured API:

    • API URL: https://api.unstructured.io/general/v0/general
    • This version is free, but comes with certain limitations.
  2. Unstructured Serverless API:

    • You'll find your unique API URL in your Unstructured account after signing up for the paid version.
    • This is a full-tier paid version of Unstructured.

For more details about the two tiers refer to Unstructured FAQ.

❗️

The API keys for the free and paid versions are different and cannot be used interchangeably.

Regardless of the chosen tier, we recommend to set the Unstructured API key as an environment variable UNSTRUCTURED_API_KEY:

export UNSTRUCTURED_API_KEY=your_api_key

On its own

import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]

In a pipeline

import os
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

document_store = InMemoryDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", UnstructuredFileConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")

indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})

With Docker

To use UnstructuredFileConverter through Docker, first, set up an Unstructured Docker container:

docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0

When initializing the component, specify the localhost URL:

from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter(api_url="http://localhost:8000/general/v0/general")

Related Links

Check out the API reference in the GitHub repo or in our docs: