Unstructured integration for Haystack
Module haystack_integrations.components.converters.unstructured.converter
UnstructuredFileConverter
A component for converting files to Haystack Documents using the Unstructured API (hosted or running locally).
For the supported file types and the specific API parameters, see Unstructured docs.
Usage example:
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
# make sure to either set the environment variable UNSTRUCTURED_API_KEY
# or run the Unstructured API locally:
# docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest
# --port 8000 --host 0.0.0.0
converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]
UnstructuredFileConverter.__init__
def __init__(api_url: str = UNSTRUCTURED_HOSTED_API_URL,
api_key: Optional[Secret] = Secret.from_env_var(
"UNSTRUCTURED_API_KEY", strict=False),
document_creation_mode: Literal[
"one-doc-per-file", "one-doc-per-page",
"one-doc-per-element"] = "one-doc-per-file",
separator: str = "\n\n",
unstructured_kwargs: Optional[Dict[str, Any]] = None,
progress_bar: bool = True)
Arguments:
api_url
: URL of the Unstructured API. Defaults to the URL of the hosted version. If you run the API locally, specify the URL of your local API (e.g."http://localhost:8000/general/v0/general"
).api_key
: API key for the Unstructured API. It can be explicitly passed or read the environment variableUNSTRUCTURED_API_KEY
(recommended). If you run the API locally, it is not needed.document_creation_mode
: How to create Haystack Documents from the elements returned by Unstructured."one-doc-per-file"
: One Haystack Document per file. All elements are concatenated into one text field."one-doc-per-page"
: One Haystack Document per page. All elements on a page are concatenated into one text field."one-doc-per-element"
: One Haystack Document per element. Each element is converted to a Haystack Document.separator
: Separator between elements when concatenating them into one text field.unstructured_kwargs
: Additional parameters that are passed to the Unstructured API. For the available parameters, see Unstructured API docs.progress_bar
: Whether to show a progress bar during the conversion.
UnstructuredFileConverter.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
UnstructuredFileConverter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "UnstructuredFileConverter"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
UnstructuredFileConverter.run
@component.output_types(documents=List[Document])
def run(paths: Union[List[str], List[os.PathLike]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Convert files to Haystack Documents using the Unstructured API.
Arguments:
paths
: List of paths to convert. Paths can be files or directories. If a path is a directory, all files in the directory are converted. Subdirectories are ignored.meta
: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. Please note that if the paths contain directories,meta
can only be a single dictionary (same metadata for all files).
Raises:
ValueError
: Ifmeta
is a list andpaths
contains directories.
Returns:
A dictionary with the following key:
documents
: List of Haystack Documents.