Various converters to transform data from one format to another.
Module azure
AzureOCRDocumentConverter
@component
class AzureOCRDocumentConverter()
A component for converting files to Documents using Azure's Document Intelligence service. Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
In order to be able to use this component, you need an active Azure account and a Document Intelligence or Cognitive Services resource. Follow the steps described in the Azure documentation to set up your resource.
Usage example:
from haystack.components.converters import AzureOCRDocumentConverter
from haystack.utils import Secret
converter = AzureOCRDocumentConverter(endpoint="<url>", api_key=Secret.from_token("<your-api-key>"))
results = converter.run(sources=["path/to/document_with_images.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'
AzureOCRDocumentConverter.__init__
def __init__(endpoint: str,
api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
model_id: str = "prebuilt-read")
Create an AzureOCRDocumentConverter component.
Arguments:
endpoint: The endpoint of your Azure resource.api_key: The key of your Azure resource.model_id: The model ID of the model you want to use. Please refer to Azure documentation for a list of available models. Default:"prebuilt-read".
AzureOCRDocumentConverter.run
@component.output_types(documents=List[Document],
raw_azure_response=List[Dict])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[List[Dict[str, Any]]] = None)
Convert a list of files to Documents using Azure's Document Intelligence service.
Arguments:
sources: List of file paths or ByteStream objects.meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
A dictionary with the following keys:
documents: List of created Documentsraw_azure_response: List of raw Azure responses used to create the Documents
AzureOCRDocumentConverter.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
AzureOCRDocumentConverter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOCRDocumentConverter"
Deserializes the component from a dictionary.
Arguments:
data: The dictionary to deserialize from.
Returns:
The deserialized component.
Module html
HTMLToDocument
@component
class HTMLToDocument()
Converts an HTML file to a Document.
Usage example:
from haystack.components.converters import HTMLToDocument
converter = HTMLToDocument()
results = converter.run(sources=["path/to/sample.html"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the HTML file.'
HTMLToDocument.__init__
def __init__(extractor_type: Literal[
"DefaultExtractor",
"ArticleExtractor",
"ArticleSentencesExtractor",
"LargestContentExtractor",
"CanolaExtractor",
"KeepEverythingExtractor",
"NumWordsRulesExtractor",
] = "DefaultExtractor")
Create an HTMLToDocument component.
Arguments:
extractor_type: Name of the extractor class to use. Defaults toDefaultExtractor. For more information on the different types of extractors, see boilerpy3 documentation.
HTMLToDocument.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
HTMLToDocument.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HTMLToDocument"
Deserializes the component from a dictionary.
Arguments:
data: The dictionary to deserialize from.
Returns:
The deserialized component.
HTMLToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts a list of HTML files to Documents.
Arguments:
sources: List of HTML file paths or ByteStream objects.meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
A dictionary with the following keys:
documents: Created Documents
Module markdown
MarkdownToDocument
@component
class MarkdownToDocument()
Converts a Markdown file into a text Document.
Usage example:
from haystack.components.converters import MarkdownToDocument
converter = MarkdownToDocument()
results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the markdown file.'
MarkdownToDocument.__init__
def __init__(table_to_single_line: bool = False, progress_bar: bool = True)
Create a MarkdownToDocument component.
Arguments:
table_to_single_line: If True converts table contents into a single line.progress_bar: If True shows a progress bar when running.
MarkdownToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts a list of Markdown files to Documents.
Arguments:
sources: List of file paths or ByteStream objects.meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
A dictionary with the following keys:
documents: List of created Documents
Module pypdf
PyPDFConverter
class PyPDFConverter(Protocol)
A protocol that defines a converter which takes a PdfReader object and converts it into a Document object.
DefaultConverter
class DefaultConverter()
The default converter class that extracts text from a PdfReader object's pages and returns a Document.
DefaultConverter.convert
def convert(reader: "PdfReader") -> Document
Extract text from the PDF and return a Document object with the text content.
PyPDFToDocument
@component
class PyPDFToDocument()
Converts PDF files to Documents.
Uses pypdf compatible converters to convert PDF files to Documents.
A default text extraction converter is used if one is not provided.
Usage example:
from haystack.components.converters.pypdf import PyPDFToDocument
converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'
PyPDFToDocument.__init__
def __init__(converter_name: str = "default")
Create an PyPDFToDocument component.
Arguments:
converter_name: Name of the registered converter to use.
PyPDFToDocument.to_dict
def to_dict()
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
PyPDFToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts PDF files to Documents.
Arguments:
sources: List of HTML file paths or ByteStream objects.meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
A dictionary with the following keys:
documents: Created Documents
Module tika
TikaDocumentConverter
@component
class TikaDocumentConverter()
Converts files of different types to Documents.
This component uses Apache Tika for parsing the files and, therefore, requires a running Tika server. For more options on running Tika, see the official documentation.
Usage example:
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter()
results = converter.run(
sources=["sample.docx", "my_document.rtf", "archive.zip"],
meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the docx file.'
TikaDocumentConverter.__init__
def __init__(tika_url: str = "http://localhost:9998/tika")
Create a TikaDocumentConverter component.
Arguments:
tika_url: Tika server URL.
TikaDocumentConverter.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts files to Documents.
Arguments:
sources: List of HTML file paths or ByteStream objects.meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
A dictionary with the following keys:
documents: Created Documents
Module txt
TextFileToDocument
@component
class TextFileToDocument()
Converts text files to Documents.
Usage example:
from haystack.components.converters.txt import TextFileToDocument
converter = TextFileToDocument()
results = converter.run(sources=["sample.txt"])
documents = results["documents"]
print(documents[0].content)
# 'This is the content from the txt file.'
TextFileToDocument.__init__
def __init__(encoding: str = "utf-8")
Create a TextFileToDocument component.
Arguments:
encoding: The encoding of the text files. Note that if the encoding is specified in the metadata of a source ByteStream, it will override this value.
TextFileToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts text files to Documents.
Arguments:
sources: List of HTML file paths or ByteStream objects.meta: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
A dictionary with the following keys:
documents: Created Documents
Module output_adapter
OutputAdaptationException
class OutputAdaptationException(Exception)
Exception raised when there is an error during output adaptation.
OutputAdapter
@component
class OutputAdapter()
Adapts output of a Component using Jinja templates.
Usage example:
from haystack import Document
from haystack.components.converters import OutputAdapter
adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
documents = [Document(content="Test content"]
result = adapter.run(documents=documents)
assert result["output"] == "Test content"
OutputAdapter.__init__
def __init__(template: str,
output_type: TypeAlias,
custom_filters: Optional[Dict[str, Callable]] = None)
Create an OutputAdapter component.
Arguments:
template: A Jinja template that defines how to adapt the input data. The variables in the template define the input of this instance. e.g. With this template:
{{ documents[0].content }}
The Component input will be documents.
output_type: The type of output this instance will return.custom_filters: A dictionary of custom Jinja filters used in the template.
OutputAdapter.run
def run(**kwargs)
Renders the Jinja template with the provided inputs.
Arguments:
kwargs: Must contain all variables used in thetemplatestring.
Raises:
OutputAdaptationException: If template rendering fails.
Returns:
A dictionary with the following keys:
output: Rendered Jinja template.
OutputAdapter.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
OutputAdapter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OutputAdapter"
Deserializes the component from a dictionary.
Arguments:
data: The dictionary to deserialize from.
Returns:
The deserialized component.
Module openapi_functions
OpenAPIServiceToFunctions
@component
class OpenAPIServiceToFunctions()
Converts OpenAPI service definitions to a format suitable for OpenAI function calling.
The definition must respect OpenAPI specification 3.0.0 or higher. It can be specified in JSON or YAML format. Each function must have: - unique operationId - description - requestBody and/or parameters - schema for the requestBody and/or parameters For more details on OpenAPI specification see the official documentation. For more details on OpenAI function calling see the official documentation.
Usage example:
from haystack.components.converters import OpenAPIServiceToFunctions
converter = OpenAPIServiceToFunctions()
result = converter.run(sources=["path/to/openapi_definition.yaml"])
assert result["functions"]
OpenAPIServiceToFunctions.__init__
def __init__()
Create an OpenAPIServiceToFunctions component.
OpenAPIServiceToFunctions.run
@component.output_types(functions=List[Dict[str, Any]],
openapi_specs=List[Dict[str, Any]])
def run(sources: List[Union[str, Path, ByteStream]]) -> Dict[str, Any]
Converts OpenAPI definitions in OpenAI function calling format.
Arguments:
sources: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).
Raises:
RuntimeError: If the OpenAPI definitions cannot be downloaded or processed.ValueError: If the source type is not recognized or no functions are found in the OpenAPI definitions.
Returns:
A dictionary with the following keys:
- functions: Function definitions in JSON object format
- openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references
