Module azure

AzureOCRDocumentConverter

Convert files to documents using Azure's Document Intelligence service.

Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.

In order to be able to use this component, you need an active Azure account and a Document Intelligence or Cognitive Services resource. Follow the steps described in the [Azure documentation] (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api) to set up your resource.

Usage example:

from haystack.components.converters import AzureOCRDocumentConverter
from haystack.utils import Secret

converter = AzureOCRDocumentConverter(endpoint="<url>", api_key=Secret.from_token("<your-api-key>"))
results = converter.run(sources=["path/to/doc_with_images.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

AzureOCRDocumentConverter.init

def __init__(endpoint: str,
             api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
             model_id: str = "prebuilt-read",
             preceding_context_len: int = 3,
             following_context_len: int = 3,
             merge_multiple_column_headers: bool = True,
             page_layout: Literal["natural", "single_column"] = "natural",
             threshold_y: Optional[float] = 0.05)

Create an AzureOCRDocumentConverter component.

Arguments:

endpoint: The endpoint of your Azure resource.
api_key: The key of your Azure resource.
model_id: The model ID of the model you want to use. Please refer to [Azure documentation] (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature) for a list of available models. Default: "prebuilt-read".
preceding_context_len: Number of lines before a table to extract as preceding context (will be returned as part of metadata).
following_context_len: Number of lines after a table to extract as subsequent context ( will be returned as part of metadata).
merge_multiple_column_headers: Some tables contain more than one row as a column header (i.e., column description). This parameter lets you choose, whether to merge multiple column header rows to a single row.
page_layout: The type reading order to follow. If "natural" is chosen then the natural reading order determined by Azure will be used. If "single_column" is chosen then all lines with the same height on the page will be grouped together based on a threshold determined by threshold_y.
threshold_y: The threshold to determine if two recognized elements in a PDF should be grouped into a single line. This is especially relevant for section headers or numbers which may be spacially separated on the horizontal axis from the remaining text. The threshold is specified in units of inches. This is only relevant if "single_column" is chosen for page_layout.

AzureOCRDocumentConverter.run

@component.output_types(documents=List[Document],
                        raw_azure_response=List[Dict])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[List[Dict[str, Any]]] = None)

Convert a list of files to Documents using Azure's Document Intelligence service.

Arguments:

sources: List of file paths or ByteStream objects.
meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

documents: List of created Documents
raw_azure_response: List of raw Azure responses used to create the Documents

AzureOCRDocumentConverter.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

AzureOCRDocumentConverter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOCRDocumentConverter"

Deserializes the component from a dictionary.

Arguments:

data: The dictionary to deserialize from.

Returns:

The deserialized component.

Module html

HTMLToDocument

Converts an HTML file to a Document.

Usage example:

from haystack.components.converters import HTMLToDocument

converter = HTMLToDocument()
results = converter.run(sources=["path/to/sample.html"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the HTML file.'

HTMLToDocument.init

def __init__(extraction_kwargs: Optional[Dict[str, Any]] = None)

Create an HTMLToDocument component.

Arguments:

extraction_kwargs: A dictionary containing keyword arguments to customize the extraction process. These are passed to the underlying Trafilatura extract function. For the full list of available arguments, see the Trafilatura documentation.

HTMLToDocument.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

HTMLToDocument.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HTMLToDocument"

Deserializes the component from a dictionary.

Arguments:

data: The dictionary to deserialize from.

Returns:

The deserialized component.

HTMLToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
        extraction_kwargs: Optional[Dict[str, Any]] = None)

Converts a list of HTML files to Documents.

Arguments:

sources: List of HTML file paths or ByteStream objects.
meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.
extraction_kwargs: Additional keyword arguments to customize the extraction process.

Returns:

A dictionary with the following keys:

documents: Created Documents

Module markdown

MarkdownToDocument

Converts a Markdown file into a text Document.

Usage example:

from haystack.components.converters import MarkdownToDocument

converter = MarkdownToDocument()
results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the markdown file.'

MarkdownToDocument.init

def __init__(table_to_single_line: bool = False, progress_bar: bool = True)

Create a MarkdownToDocument component.

Arguments:

table_to_single_line: If True converts table contents into a single line.
progress_bar: If True shows a progress bar when running.

MarkdownToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts a list of Markdown files to Documents.

Arguments:

sources: List of file paths or ByteStream objects.
meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

documents: List of created Documents

Module pdfminer

PDFMinerToDocument

Converts PDF files to Documents.

Uses pdfminer compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/

Usage example:

from haystack.components.converters.pdfminer import PDFMinerToDocument

converter = PDFMinerToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

PDFMinerToDocument.init

def __init__(line_overlap: float = 0.5,
             char_margin: float = 2.0,
             line_margin: float = 0.5,
             word_margin: float = 0.1,
             boxes_flow: Optional[float] = 0.5,
             detect_vertical: bool = True,
             all_texts: bool = False) -> None

Create a PDFMinerToDocument component.

Arguments:

line_overlap: This parameter determines whether two characters are considered to be on the same line based on the amount of overlap between them. The overlap is calculated relative to the minimum height of both characters.
char_margin: Determines whether two characters are part of the same line based on the distance between them. If the distance is less than the margin specified, the characters are considered to be on the same line. The margin is calculated relative to the width of the character.
word_margin: Determines whether two characters on the same line are part of the same word based on the distance between them. If the distance is greater than the margin specified, an intermediate space will be added between them to make the text more readable. The margin is calculated relative to the width of the character.
line_margin: This parameter determines whether two lines are part of the same paragraph based on the distance between them. If the distance is less than the margin specified, the lines are considered to be part of the same paragraph. The margin is calculated relative to the height of a line.
boxes_flow: This parameter determines the importance of horizontal and vertical position when determining the order of text boxes. A value between -1.0 and +1.0 can be set, with -1.0 indicating that only horizontal position matters and +1.0 indicating that only vertical position matters. Setting the value to 'None' will disable advanced layout analysis, and text boxes will be ordered based on the position of their bottom left corner.
detect_vertical: This parameter determines whether vertical text should be considered during layout analysis.
all_texts: If layout analysis should be performed on text in figures.

PDFMinerToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts PDF files to Documents.

Arguments:

sources: List of PDF file paths or ByteStream objects.
meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

documents: Created Documents

Module pypdf

PyPDFConverter

A protocol that defines a converter which takes a PdfReader object and converts it into a Document object.

DefaultConverter

The default converter class that extracts text from a PdfReader object's pages and returns a Document.

DefaultConverter.convert

def convert(reader: "PdfReader") -> Document

Extract text from the PDF and return a Document object with the text content.

DefaultConverter.to_dict

def to_dict()

Serialize the converter to a dictionary.

DefaultConverter.from_dict

@classmethod
def from_dict(cls, data)

Deserialize the converter from a dictionary.

PyPDFToDocument

Converts PDF files to documents your pipeline can query.

This component uses converters compatible with the PyPDF library. If no converter is provided, uses a default text extraction converter. You can attach metadata to the resulting documents.

Usage example

from haystack.components.converters.pypdf import PyPDFToDocument

converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

PyPDFToDocument.init

def __init__(converter: Optional[PyPDFConverter] = None)

Create an PyPDFToDocument component.

Arguments:

converter: An instance of a PyPDFConverter compatible class.

PyPDFToDocument.to_dict

def to_dict()

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

PyPDFToDocument.from_dict

@classmethod
def from_dict(cls, data)

Deserializes the component from a dictionary.

Arguments:

data: Dictionary with serialized data.

Returns:

Deserialized component.

PyPDFToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts PDF files to documents.

Arguments:

sources: List of file paths or ByteStream objects to convert.
meta: Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources, as they are zipped together. For ByteStream objects, their meta is added to the output documents.

Returns:

A dictionary with the following keys:

documents: A list of converted documents.

Module pptx

PPTXToDocument

Converts PPTX files to Documents.

Usage example:

from haystack.components.converters.pptx import PPTXToDocument

converter = PPTXToDocument()
results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is the text from the PPTX file.'

PPTXToDocument.init

def __init__()

Create an PPTXToDocument component.

PPTXToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts PPTX files to Documents.

Arguments:

sources: List of file paths or ByteStream objects.
meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

documents: Created Documents

Module tika

XHTMLParser

Custom parser to extract pages from Tika XHTML content.

XHTMLParser.handle_starttag

def handle_starttag(tag: str, attrs: List[tuple])

Identify the start of a page div.

XHTMLParser.handle_endtag

def handle_endtag(tag: str)

Identify the end of a page div.

XHTMLParser.handle_data

def handle_data(data: str)

Populate the page content.

TikaDocumentConverter

Converts files of different types to Documents.

This component uses Apache Tika for parsing the files and, therefore, requires a running Tika server. For more options on running Tika, see the official documentation.

Usage example:

from haystack.components.converters.tika import TikaDocumentConverter

converter = TikaDocumentConverter()
results = converter.run(
    sources=["sample.docx", "my_document.rtf", "archive.zip"],
    meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the docx file.'

TikaDocumentConverter.init

def __init__(tika_url: str = "http://localhost:9998/tika")

Create a TikaDocumentConverter component.

Arguments:

tika_url: Tika server URL.

TikaDocumentConverter.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts files to Documents.

Arguments:

sources: List of HTML file paths or ByteStream objects.
meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

documents: Created Documents

Module txt

TextFileToDocument

Converts text files to documents your pipeline can query.

By default, it uses UTF-8 encoding when converting files but you can also set custom encoding. It can attach metadata to the resulting documents.

Usage example

from haystack.components.converters.txt import TextFileToDocument

converter = TextFileToDocument()
results = converter.run(sources=["sample.txt"])
documents = results["documents"]
print(documents[0].content)
# 'This is the content from the txt file.'

TextFileToDocument.init

def __init__(encoding: str = "utf-8")

Creates a TextFileToDocument component.

Arguments:

encoding: The encoding of the text files to convert. If the encoding is specified in the metadata of a source ByteStream, it overrides this value.

TextFileToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts text files to documents.

Arguments:

sources: List of HTML file paths or ByteStream objects to convert.
meta: Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources as they're zipped together. For ByteStream objects, their meta is added to the output documents.

Returns:

A dictionary with the following keys:

documents: A list of converted documents.

Module output_adapter

OutputAdaptationException

Exception raised when there is an error during output adaptation.

OutputAdapter

Adapts output of a Component using Jinja templates.

Usage example:

from haystack import Document
from haystack.components.converters import OutputAdapter

adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
documents = [Document(content="Test content"]
result = adapter.run(documents=documents)

assert result["output"] == "Test content"

OutputAdapter.init

def __init__(template: str,
             output_type: TypeAlias,
             custom_filters: Optional[Dict[str, Callable]] = None)

Create an OutputAdapter component.

Arguments:

template: A Jinja template that defines how to adapt the input data. The variables in the template define the input of this instance. e.g. With this template:

{{ documents[0].content }}

The Component input will be documents.

output_type: The type of output this instance will return.
custom_filters: A dictionary of custom Jinja filters used in the template.

OutputAdapter.run

def run(**kwargs)

Renders the Jinja template with the provided inputs.

Arguments:

kwargs: Must contain all variables used in the template string.

Raises:

OutputAdaptationException: If template rendering fails.

Returns:

A dictionary with the following keys:

output: Rendered Jinja template.

OutputAdapter.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

OutputAdapter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OutputAdapter"

Deserializes the component from a dictionary.

Arguments:

data: The dictionary to deserialize from.

Returns:

The deserialized component.

Module openapi_functions

OpenAPIServiceToFunctions

Converts OpenAPI service definitions to a format suitable for OpenAI function calling.

The definition must respect OpenAPI specification 3.0.0 or higher. It can be specified in JSON or YAML format. Each function must have: - unique operationId - description - requestBody and/or parameters - schema for the requestBody and/or parameters For more details on OpenAPI specification see the official documentation. For more details on OpenAI function calling see the official documentation.

Usage example:

from haystack.components.converters import OpenAPIServiceToFunctions

converter = OpenAPIServiceToFunctions()
result = converter.run(sources=["path/to/openapi_definition.yaml"])
assert result["functions"]

OpenAPIServiceToFunctions.init

def __init__()

Create an OpenAPIServiceToFunctions component.

OpenAPIServiceToFunctions.run

@component.output_types(functions=List[Dict[str, Any]],
                        openapi_specs=List[Dict[str, Any]])
def run(sources: List[Union[str, Path, ByteStream]]) -> Dict[str, Any]

Converts OpenAPI definitions in OpenAI function calling format.

Arguments:

sources: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).

Raises:

RuntimeError: If the OpenAPI definitions cannot be downloaded or processed.
ValueError: If the source type is not recognized or no functions are found in the OpenAPI definitions.

Returns:

A dictionary with the following keys:

functions: Function definitions in JSON object format
openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references

Module docx

DOCXMetadata

Describes the metadata of Docx file.

Arguments:

author: The author
category: The category
comments: The comments
content_status: The content status
created: The creation date
identifier: The identifier
keywords: Available keywords
language: The language of the document
last_modified_by: The last modified by user date
last_printed: The last printed date
modified: The last modification date
revision: The revision number
subject: The subject
title: The title
version: The version

DOCXToDocument

Converts DOCX files to Documents.

Uses python-docx library to convert the DOCX file to a document. This component does not preserve page breaks in the original document.

Usage example:

from haystack.components.converters.docx import DOCXToDocument

converter = DOCXToDocument()
results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the DOCX file.'

DOCXToDocument.init

def __init__()

Create a DOCXToDocument component.

DOCXToDocument.run

@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)

Converts DOCX files to Documents.

Arguments:

sources: List of file paths or ByteStream objects.
meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

A dictionary with the following keys:

documents: Created Documents

Module azure

AzureOCRDocumentConverter

AzureOCRDocumentConverter.__init__

AzureOCRDocumentConverter.run

AzureOCRDocumentConverter.to_dict

AzureOCRDocumentConverter.from_dict

Module html

HTMLToDocument

HTMLToDocument.__init__

HTMLToDocument.to_dict

HTMLToDocument.from_dict

HTMLToDocument.run

Module markdown

MarkdownToDocument

MarkdownToDocument.__init__

MarkdownToDocument.run

Module pdfminer

PDFMinerToDocument

PDFMinerToDocument.__init__

PDFMinerToDocument.run

Module pypdf

PyPDFConverter

DefaultConverter

DefaultConverter.convert

DefaultConverter.to_dict

DefaultConverter.from_dict

PyPDFToDocument

Usage example

PyPDFToDocument.__init__

PyPDFToDocument.to_dict

PyPDFToDocument.from_dict

PyPDFToDocument.run

Module pptx

PPTXToDocument

PPTXToDocument.__init__

PPTXToDocument.run

Module tika

XHTMLParser

XHTMLParser.handle_starttag

XHTMLParser.handle_endtag

XHTMLParser.handle_data

TikaDocumentConverter

TikaDocumentConverter.__init__

TikaDocumentConverter.run

Module txt

TextFileToDocument

Usage example

TextFileToDocument.__init__

TextFileToDocument.run

Module output_adapter

OutputAdaptationException

OutputAdapter

OutputAdapter.__init__

OutputAdapter.run

OutputAdapter.to_dict

OutputAdapter.from_dict

Module openapi_functions

OpenAPIServiceToFunctions

OpenAPIServiceToFunctions.__init__

OpenAPIServiceToFunctions.run

Module docx

DOCXMetadata

DOCXToDocument

DOCXToDocument.__init__

DOCXToDocument.run

AzureOCRDocumentConverter.init

HTMLToDocument.init

MarkdownToDocument.init

PDFMinerToDocument.init

PyPDFToDocument.init

PPTXToDocument.init

TikaDocumentConverter.init

TextFileToDocument.init

OutputAdapter.init

OpenAPIServiceToFunctions.init

DOCXToDocument.init