DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Unstructured integration for Haystack

Module haystack_integrations.components.converters.unstructured.converter

UnstructuredFileConverter

A component for converting files to Haystack Documents using the Unstructured API (hosted or running locally).

For the supported file types and the specific API parameters, see Unstructured docs.

Usage example:

from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

# make sure to either set the environment variable UNSTRUCTURED_API_KEY
# or run the Unstructured API locally:
# docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest
# --port 8000 --host 0.0.0.0

converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]

UnstructuredFileConverter.__init__

def __init__(api_url: str = UNSTRUCTURED_HOSTED_API_URL,
             api_key: Optional[Secret] = Secret.from_env_var(
                 "UNSTRUCTURED_API_KEY", strict=False),
             document_creation_mode: Literal[
                 "one-doc-per-file", "one-doc-per-page",
                 "one-doc-per-element"] = "one-doc-per-file",
             separator: str = "\n\n",
             unstructured_kwargs: Optional[Dict[str, Any]] = None,
             progress_bar: bool = True)

Arguments:

  • api_url: URL of the Unstructured API. Defaults to the URL of the hosted version. If you run the API locally, specify the URL of your local API (e.g. "http://localhost:8000/general/v0/general").
  • api_key: API key for the Unstructured API. It can be explicitly passed or read the environment variable UNSTRUCTURED_API_KEY (recommended). If you run the API locally, it is not needed.
  • document_creation_mode: How to create Haystack Documents from the elements returned by Unstructured. "one-doc-per-file": One Haystack Document per file. All elements are concatenated into one text field. "one-doc-per-page": One Haystack Document per page. All elements on a page are concatenated into one text field. "one-doc-per-element": One Haystack Document per element. Each element is converted to a Haystack Document.
  • separator: Separator between elements when concatenating them into one text field.
  • unstructured_kwargs: Additional parameters that are passed to the Unstructured API. For the available parameters, see Unstructured API docs.
  • progress_bar: Whether to show a progress bar during the conversion.

UnstructuredFileConverter.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

UnstructuredFileConverter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "UnstructuredFileConverter"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

UnstructuredFileConverter.run

@component.output_types(documents=List[Document])
def run(paths: Union[List[str], List[os.PathLike]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Convert files to Haystack Documents using the Unstructured API.

Arguments:

  • paths: List of paths to convert. Paths can be files or directories. If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
  • meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. Please note that if the paths contain directories, meta can only be a single dictionary (same metadata for all files).

Raises:

  • ValueError: If meta is a list and paths contains directories.

Returns:

A dictionary with the following key:

  • documents: List of Haystack Documents.