DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Various converters to transform data from one format to another.

Module azure

AzureOCRDocumentConverter

@component
class AzureOCRDocumentConverter()

A component for converting files to Documents using Azure's Document Intelligence service. Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.

In order to be able to use this component, you need an active Azure account and a Document Intelligence or Cognitive Services resource. Follow the steps described in the Azure documentation to set up your resource.

Usage example:

from haystack.components.converters import AzureOCRDocumentConverter
from haystack.utils import Secret

converter = AzureOCRDocumentConverter(endpoint="<url>", api_key=Secret.from_token("<your-api-key>"))
results = converter.run(sources=["path/to/document_with_images.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

AzureOCRDocumentConverter.__init__

def __init__(endpoint: str,
             api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
             model_id: str = "prebuilt-read")

Create an AzureOCRDocumentConverter component.

Arguments:

  • endpoint: The endpoint of your Azure resource.
  • api_key: The key of your Azure resource.
  • model_id: The model ID of the model you want to use. Please refer to Azure documentation for a list of available models. Default: "prebuilt-read".

AzureOCRDocumentConverter.run

@component.output_types(documents=List[Document],
                        raw_azure_response=List[Dict])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[List[Dict[str, Any]]] = None)

Convert a list of files to Documents using Azure's Document Intelligence service.

Arguments:

  • sources: List of file paths or ByteStream objects.
  • meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

  • documents: List of created Documents
  • raw_azure_response: List of raw Azure responses used to create the Documents

AzureOCRDocumentConverter.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

AzureOCRDocumentConverter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOCRDocumentConverter"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

Module html

HTMLToDocument

@component
class HTMLToDocument()

Converts an HTML file to a Document.

Usage example:

from haystack.components.converters import HTMLToDocument

converter = HTMLToDocument()
results = converter.run(sources=["path/to/sample.html"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the HTML file.'

HTMLToDocument.__init__

def __init__(extractor_type: Literal[
    "DefaultExtractor",
    "ArticleExtractor",
    "ArticleSentencesExtractor",
    "LargestContentExtractor",
    "CanolaExtractor",
    "KeepEverythingExtractor",
    "NumWordsRulesExtractor",
] = "DefaultExtractor")

Create an HTMLToDocument component.

Arguments:

  • extractor_type: Name of the extractor class to use. Defaults to DefaultExtractor. For more information on the different types of extractors, see boilerpy3 documentation.

HTMLToDocument.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

HTMLToDocument.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HTMLToDocument"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

HTMLToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts a list of HTML files to Documents.

Arguments:

  • sources: List of HTML file paths or ByteStream objects.
  • meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

  • documents: Created Documents

Module markdown

MarkdownToDocument

@component
class MarkdownToDocument()

Converts a Markdown file into a text Document.

Usage example:

from haystack.components.converters import MarkdownToDocument

converter = MarkdownToDocument()
results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the markdown file.'

MarkdownToDocument.__init__

def __init__(table_to_single_line: bool = False, progress_bar: bool = True)

Create a MarkdownToDocument component.

Arguments:

  • table_to_single_line: If True converts table contents into a single line.
  • progress_bar: If True shows a progress bar when running.

MarkdownToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts a list of Markdown files to Documents.

Arguments:

  • sources: List of file paths or ByteStream objects.
  • meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

  • documents: List of created Documents

Module pypdf

PyPDFConverter

class PyPDFConverter(Protocol)

A protocol that defines a converter which takes a PdfReader object and converts it into a Document object.

DefaultConverter

class DefaultConverter()

The default converter class that extracts text from a PdfReader object's pages and returns a Document.

DefaultConverter.convert

def convert(reader: "PdfReader") -> Document

Extract text from the PDF and return a Document object with the text content.

PyPDFToDocument

@component
class PyPDFToDocument()

Converts PDF files to Documents.

Uses pypdf compatible converters to convert PDF files to Documents. A default text extraction converter is used if one is not provided.

Usage example:

from haystack.components.converters.pypdf import PyPDFToDocument

converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

PyPDFToDocument.__init__

def __init__(converter_name: str = "default")

Create an PyPDFToDocument component.

Arguments:

  • converter_name: Name of the registered converter to use.

PyPDFToDocument.to_dict

def to_dict()

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

PyPDFToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts PDF files to Documents.

Arguments:

  • sources: List of HTML file paths or ByteStream objects.
  • meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

  • documents: Created Documents

Module tika

TikaDocumentConverter

@component
class TikaDocumentConverter()

Converts files of different types to Documents.

This component uses Apache Tika for parsing the files and, therefore, requires a running Tika server. For more options on running Tika, see the official documentation.

Usage example:

from haystack.components.converters.tika import TikaDocumentConverter

converter = TikaDocumentConverter()
results = converter.run(
    sources=["sample.docx", "my_document.rtf", "archive.zip"],
    meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the docx file.'

TikaDocumentConverter.__init__

def __init__(tika_url: str = "http://localhost:9998/tika")

Create a TikaDocumentConverter component.

Arguments:

  • tika_url: Tika server URL.

TikaDocumentConverter.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts files to Documents.

Arguments:

  • sources: List of HTML file paths or ByteStream objects.
  • meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

  • documents: Created Documents

Module txt

TextFileToDocument

@component
class TextFileToDocument()

Converts text files to Documents.

Usage example:

from haystack.components.converters.txt import TextFileToDocument

converter = TextFileToDocument()
results = converter.run(sources=["sample.txt"])
documents = results["documents"]
print(documents[0].content)
# 'This is the content from the txt file.'

TextFileToDocument.__init__

def __init__(encoding: str = "utf-8")

Create a TextFileToDocument component.

Arguments:

  • encoding: The encoding of the text files. Note that if the encoding is specified in the metadata of a source ByteStream, it will override this value.

TextFileToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts text files to Documents.

Arguments:

  • sources: List of HTML file paths or ByteStream objects.
  • meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

  • documents: Created Documents

Module output_adapter

OutputAdaptationException

class OutputAdaptationException(Exception)

Exception raised when there is an error during output adaptation.

OutputAdapter

@component
class OutputAdapter()

Adapts output of a Component using Jinja templates.

Usage example:

from haystack import Document
from haystack.components.converters import OutputAdapter

adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
documents = [Document(content="Test content"]
result = adapter.run(documents=documents)

assert result["output"] == "Test content"

OutputAdapter.__init__

def __init__(template: str,
             output_type: TypeAlias,
             custom_filters: Optional[Dict[str, Callable]] = None)

Create an OutputAdapter component.

Arguments:

  • template: A Jinja template that defines how to adapt the input data. The variables in the template define the input of this instance. e.g. With this template:
{{ documents[0].content }}

The Component input will be documents.

  • output_type: The type of output this instance will return.
  • custom_filters: A dictionary of custom Jinja filters used in the template.

OutputAdapter.run

def run(**kwargs)

Renders the Jinja template with the provided inputs.

Arguments:

  • kwargs: Must contain all variables used in the template string.

Raises:

  • OutputAdaptationException: If template rendering fails.

Returns:

A dictionary with the following keys:

  • output: Rendered Jinja template.

OutputAdapter.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

OutputAdapter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OutputAdapter"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

Module openapi_functions

OpenAPIServiceToFunctions

@component
class OpenAPIServiceToFunctions()

Converts OpenAPI service definitions to a format suitable for OpenAI function calling.

The definition must respect OpenAPI specification 3.0.0 or higher. It can be specified in JSON or YAML format. Each function must have: - unique operationId - description - requestBody and/or parameters - schema for the requestBody and/or parameters For more details on OpenAPI specification see the official documentation. For more details on OpenAI function calling see the official documentation.

Usage example:

from haystack.components.converters import OpenAPIServiceToFunctions

converter = OpenAPIServiceToFunctions()
result = converter.run(sources=["path/to/openapi_definition.yaml"])
assert result["functions"]

OpenAPIServiceToFunctions.__init__

def __init__()

Create an OpenAPIServiceToFunctions component.

OpenAPIServiceToFunctions.run

@component.output_types(functions=List[Dict[str, Any]],
                        openapi_specs=List[Dict[str, Any]])
def run(sources: List[Union[str, Path, ByteStream]]) -> Dict[str, Any]

Converts OpenAPI definitions in OpenAI function calling format.

Arguments:

  • sources: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).

Raises:

  • RuntimeError: If the OpenAPI definitions cannot be downloaded or processed.
  • ValueError: If the source type is not recognized or no functions are found in the OpenAPI definitions.

Returns:

A dictionary with the following keys:

  • functions: Function definitions in JSON object format
  • openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references