Skip to main content
Version: 2.25-unstable

Converters

azure

AzureOCRDocumentConverter

Converts files to documents using Azure's Document Intelligence service.

Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.

To use this component, you need an active Azure account and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see Azure documentation.

Usage example

python
import os
from datetime import datetime
from haystack.components.converters import AzureOCRDocumentConverter
from haystack.utils import Secret

converter = AzureOCRDocumentConverter(
endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"],
api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"),
)
results = converter.run(
sources=["test/test_files/pdf/react_paper.pdf"],
meta={"date_added": datetime.now().isoformat()},
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

init

python
__init__(
endpoint: str,
api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
model_id: str = "prebuilt-read",
preceding_context_len: int = 3,
following_context_len: int = 3,
merge_multiple_column_headers: bool = True,
page_layout: Literal["natural", "single_column"] = "natural",
threshold_y: float | None = 0.05,
store_full_path: bool = False,
)

Creates an AzureOCRDocumentConverter component.

Parameters:

  • endpoint (str) – The endpoint of your Azure resource.
  • api_key (Secret) – The API key of your Azure resource.
  • model_id (str) – The ID of the model you want to use. For a list of available models, see [Azure documentation] (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
  • preceding_context_len (int) – Number of lines before a table to include as preceding context (this will be added to the metadata).
  • following_context_len (int) – Number of lines after a table to include as subsequent context ( this will be added to the metadata).
  • merge_multiple_column_headers (bool) – If True, merges multiple column header rows into a single row.
  • page_layout (Literal['natural', 'single_column']) – The type reading order to follow. Possible options:
  • natural: Uses the natural reading order determined by Azure.
  • single_column: Groups all lines with the same height on the page based on a threshold determined by threshold_y.
  • threshold_y (float | None) – Only relevant if single_column is set to page_layout. The threshold, in inches, to determine if two recognized PDF elements are grouped into a single line. This is crucial for section headers or numbers which may be spatially separated from the remaining text on the horizontal axis.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Convert a list of files to Documents using Azure's Document Intelligence service.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

  • – A dictionary with the following keys:
  • documents: List of created Documents
  • raw_azure_response: List of raw Azure responses used to create the Documents

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> AzureOCRDocumentConverter

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – The dictionary to deserialize from.

Returns:

  • AzureOCRDocumentConverter – The deserialized component.

csv

CSVToDocument

Converts CSV files to Documents.

By default, it uses UTF-8 encoding when converting files but you can also set a custom encoding. It can attach metadata to the resulting documents.

Usage example

python
from haystack.components.converters.csv import CSVToDocument
converter = CSVToDocument()
results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'col1,col2\nrow1,row1\nrow2,row2\n'

init

python
__init__(
encoding: str = "utf-8",
store_full_path: bool = False,
*,
conversion_mode: Literal["file", "row"] = "file",
delimiter: str = ",",
quotechar: str = '"'
)

Creates a CSVToDocument component.

Parameters:

  • encoding (str) – The encoding of the csv files to convert. If the encoding is specified in the metadata of a source ByteStream, it overrides this value.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
  • conversion_mode (Literal['file', 'row']) – - "file" (default): one Document per CSV file whose content is the raw CSV text.
  • "row": convert each CSV row to its own Document (requires content_column in run()).
  • delimiter (str) – CSV delimiter used when parsing in row mode (passed to csv.DictReader).
  • quotechar (str) – CSV quote character used when parsing in row mode (passed to csv.DictReader).

run

python
run(
sources: list[str | Path | ByteStream],
*,
content_column: str | None = None,
meta: dict[str, Any] | list[dict[str, Any]] | None = None
)

Converts CSV files to a Document (file mode) or to one Document per row (row mode).

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects.
  • content_column (str | None) – Required when conversion_mode="row". The column name whose values become Document.content for each row. The column must exist in the CSV header.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output documents.

Returns:

  • – A dictionary with the following keys:
  • documents: Created documents

docx

DOCXMetadata

Describes the metadata of Docx file.

Parameters:

  • author (str) – The author
  • category (str) – The category
  • comments (str) – The comments
  • content_status (str) – The content status
  • created (str | None) – The creation date (ISO formatted string)
  • identifier (str) – The identifier
  • keywords (str) – Available keywords
  • language (str) – The language of the document
  • last_modified_by (str) – User who last modified the document
  • last_printed (str | None) – The last printed date (ISO formatted string)
  • modified (str | None) – The last modification date (ISO formatted string)
  • revision (int) – The revision number
  • subject (str) – The subject
  • title (str) – The title
  • version (str) – The version

DOCXTableFormat

Bases: Enum

Supported formats for storing DOCX tabular data in a Document.

from_str

python
from_str(string: str) -> DOCXTableFormat

Convert a string to a DOCXTableFormat enum.

DOCXLinkFormat

Bases: Enum

Supported formats for storing DOCX link information in a Document.

from_str

python
from_str(string: str) -> DOCXLinkFormat

Convert a string to a DOCXLinkFormat enum.

DOCXToDocument

Converts DOCX files to Documents.

Uses python-docx library to convert the DOCX file to a document. This component does not preserve page breaks in the original document.

Usage example:

python
from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat

converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN)
results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the DOCX file.'

init

python
__init__(
table_format: str | DOCXTableFormat = DOCXTableFormat.CSV,
link_format: str | DOCXLinkFormat = DOCXLinkFormat.NONE,
store_full_path: bool = False,
)

Create a DOCXToDocument component.

Parameters:

  • table_format (str | DOCXTableFormat) – The format for table output. Can be either DOCXTableFormat.MARKDOWN, DOCXTableFormat.CSV, "markdown", or "csv".
  • link_format (str | DOCXLinkFormat) – The format for link output. Can be either: DOCXLinkFormat.MARKDOWN or "markdown" to get [text](address), DOCXLinkFormat.PLAIN or "plain" to get text (address), DOCXLinkFormat.NONE or "none" to get text without links.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> DOCXToDocument

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – The dictionary to deserialize from.

Returns:

  • DOCXToDocument – The deserialized component.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Converts DOCX files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

  • – A dictionary with the following keys:
  • documents: Created Documents

file_to_file_content

FileToFileContent

Converts files to FileContent objects to be included in ChatMessage objects.

Usage example

python
from haystack.components.converters import FileToFileContent

converter = FileToFileContent()

sources = ["document.pdf", "video.mp4"]

file_contents = converter.run(sources=sources)["file_contents"]
print(file_contents)

# [FileContent(base64_data='...',
# mime_type='application/pdf',
# filename='document.pdf',
# extra={}),
# ...]

run

python
run(
sources: list[str | Path | ByteStream],
*,
extra: dict[str, Any] | list[dict[str, Any]] | None = None
) -> dict[str, list[FileContent]]

Converts files to FileContent objects.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert.
  • extra (dict[str, Any] | list[dict[str, Any]] | None) – Optional extra information to attach to the FileContent objects. Can be used to store provider-specific information. To avoid serialization issues, values should be JSON serializable. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the extra of all produced FileContent objects. If it's a list, its length must match the number of sources as they're zipped together.

Returns:

  • dict[str, list[FileContent]] – A dictionary with the following keys:
  • file_contents: A list of FileContent objects.

html

HTMLToDocument

Converts an HTML file to a Document.

Usage example:

python
from haystack.components.converters import HTMLToDocument

converter = HTMLToDocument()
results = converter.run(sources=["path/to/sample.html"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the HTML file.'

init

python
__init__(
extraction_kwargs: dict[str, Any] | None = None,
store_full_path: bool = False,
)

Create an HTMLToDocument component.

Parameters:

  • extraction_kwargs (dict[str, Any] | None) – A dictionary containing keyword arguments to customize the extraction process. These are passed to the underlying Trafilatura extract function. For the full list of available arguments, see the Trafilatura documentation.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> HTMLToDocument

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – The dictionary to deserialize from.

Returns:

  • HTMLToDocument – The deserialized component.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
extraction_kwargs: dict[str, Any] | None = None,
)

Converts a list of HTML files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of HTML file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.
  • extraction_kwargs (dict[str, Any] | None) – Additional keyword arguments to customize the extraction process.

Returns:

  • – A dictionary with the following keys:
  • documents: Created Documents

image/document_to_image

DocumentToImageContent

Converts documents sourced from PDF and image files into ImageContents.

This component processes a list of documents and extracts visual content from supported file formats, converting them into ImageContents that can be used for multimodal AI tasks. It handles both direct image files and PDF documents by extracting specific pages as images.

Documents are expected to have metadata containing:

  • The file_path_meta_field key with a valid file path that exists when combined with root_path
  • A supported image format (MIME type must be one of the supported image types)
  • For PDF files, a page_number key specifying which page to extract

Usage example

```python
from haystack import Document
from haystack.components.converters.image.document_to_image import DocumentToImageContent

converter = DocumentToImageContent(
file_path_meta_field="file_path",
root_path="/data/files",
detail="high",
size=(800, 600)
)

documents = [
Document(content="Optional description of image.jpg", meta={"file_path": "image.jpg"}),
Document(content="Text content of page 1 of doc.pdf", meta={"file_path": "doc.pdf", "page_number": 1})
]

result = converter.run(documents)
image_contents = result["image_contents"]
# [ImageContent(
# base64_image='/9j/4A...', mime_type='image/jpeg', detail='high', meta={'file_path': 'image.jpg'}
# ),
# ImageContent(
# base64_image='/9j/4A...', mime_type='image/jpeg', detail='high',
# meta={'page_number': 1, 'file_path': 'doc.pdf'}
# )]
```

init

python
__init__(
*,
file_path_meta_field: str = "file_path",
root_path: str | None = None,
detail: Literal["auto", "high", "low"] | None = None,
size: tuple[int, int] | None = None
)

Initialize the DocumentToImageContent component.

Parameters:

  • file_path_meta_field (str) – The metadata field in the Document that contains the file path to the image or PDF.
  • root_path (str | None) – The root directory path where document files are located. If provided, file paths in document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
  • detail (Literal['auto', 'high', 'low'] | None) – Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". This will be passed to the created ImageContent objects.
  • size (tuple[int, int] | None) – If provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial when working with models that have resolution constraints or when transmitting images to remote services.

run

python
run(documents: list[Document]) -> dict[str, list[ImageContent | None]]

Convert documents with image or PDF sources into ImageContent objects.

This method processes the input documents, extracting images from supported file formats and converting them into ImageContent objects.

Parameters:

  • documents (list[Document]) – A list of documents to process. Each document should have metadata containing at minimum a 'file_path_meta_field' key. PDF documents additionally require a 'page_number' key to specify which page to convert.

Returns:

  • dict[str, list[ImageContent | None]] – Dictionary containing one key:
  • "image_contents": ImageContents created from the processed documents. These contain base64-encoded image data and metadata. The order corresponds to order of input documents.

Raises:

  • ValueError – If any document is missing the required metadata keys, has an invalid file path, or has an unsupported MIME type. The error message will specify which document and what information is missing or incorrect.

image/file_to_document

ImageFileToDocument

Converts image file references into empty Document objects with associated metadata.

This component is useful in pipelines where image file paths need to be wrapped in Document objects to be processed by downstream components such as the SentenceTransformersImageDocumentEmbedder.

It does not extract any content from the image files, instead it creates Document objects with None as their content and attaches metadata such as file path and any user-provided values.

Usage example

python
from haystack.components.converters.image import ImageFileToDocument

converter = ImageFileToDocument()

sources = ["image.jpg", "another_image.png"]

result = converter.run(sources=sources)
documents = result["documents"]

print(documents)

# [Document(id=..., meta: {'file_path': 'image.jpg'}),
# Document(id=..., meta: {'file_path': 'another_image.png'})]

init

python
__init__(*, store_full_path: bool = False)

Initialize the ImageFileToDocument component.

Parameters:

  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

run

python
run(
*,
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None
) -> dict[str, list[Document]]

Convert image files into empty Document objects with metadata.

This method accepts image file references (as file paths or ByteStreams) and creates Document objects without content. These documents are enriched with metadata derived from the input source and optional user-provided metadata.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources, as they are zipped together. For ByteStream objects, their meta is added to the output documents.

Returns:

  • dict[str, list[Document]] – A dictionary containing:
  • documents: A list of Document objects with empty content and associated metadata.

image/file_to_image

ImageFileToImageContent

Converts image files to ImageContent objects.

Usage example

python
from haystack.components.converters.image import ImageFileToImageContent

converter = ImageFileToImageContent()

sources = ["image.jpg", "another_image.png"]

image_contents = converter.run(sources=sources)["image_contents"]
print(image_contents)

# [ImageContent(base64_image='...',
# mime_type='image/jpeg',
# detail=None,
# meta={'file_path': 'image.jpg'}),
# ...]

init

python
__init__(
*,
detail: Literal["auto", "high", "low"] | None = None,
size: tuple[int, int] | None = None
)

Create the ImageFileToImageContent component.

Parameters:

  • detail (Literal['auto', 'high', 'low'] | None) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". This will be passed to the created ImageContent objects.
  • size (tuple[int, int] | None) – If provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial when working with models that have resolution constraints or when transmitting images to remote services.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
*,
detail: Literal["auto", "high", "low"] | None = None,
size: tuple[int, int] | None = None
) -> dict[str, list[ImageContent]]

Converts files to ImageContent objects.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the ImageContent objects. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects. If it's a list, its length must match the number of sources as they're zipped together. For ByteStream objects, their meta is added to the output ImageContent objects.
  • detail (Literal['auto', 'high', 'low'] | None) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". This will be passed to the created ImageContent objects. If not provided, the detail level will be the one set in the constructor.
  • size (tuple[int, int] | None) – If provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial when working with models that have resolution constraints or when transmitting images to remote services. If not provided, the size value will be the one set in the constructor.

Returns:

  • dict[str, list[ImageContent]] – A dictionary with the following keys:
  • image_contents: A list of ImageContent objects.

image/pdf_to_image

PDFToImageContent

Converts PDF files to ImageContent objects.

Usage example

python
from haystack.components.converters.image import PDFToImageContent

converter = PDFToImageContent()

sources = ["file.pdf", "another_file.pdf"]

image_contents = converter.run(sources=sources)["image_contents"]
print(image_contents)

# [ImageContent(base64_image='...',
# mime_type='application/pdf',
# detail=None,
# meta={'file_path': 'file.pdf', 'page_number': 1}),
# ...]

init

python
__init__(
*,
detail: Literal["auto", "high", "low"] | None = None,
size: tuple[int, int] | None = None,
page_range: list[str | int] | None = None
)

Create the PDFToImageContent component.

Parameters:

  • detail (Literal['auto', 'high', 'low'] | None) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". This will be passed to the created ImageContent objects.
  • size (tuple[int, int] | None) – If provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial when working with models that have resolution constraints or when transmitting images to remote services.
  • page_range (list[str | int] | None) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1. If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages) will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] will convert pages 1, 2, 3, 5, 8, 10, 11, 12.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
*,
detail: Literal["auto", "high", "low"] | None = None,
size: tuple[int, int] | None = None,
page_range: list[str | int] | None = None
) -> dict[str, list[ImageContent]]

Converts files to ImageContent objects.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the ImageContent objects. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced ImageContent objects. If it's a list, its length must match the number of sources as they're zipped together. For ByteStream objects, their meta is added to the output ImageContent objects.
  • detail (Literal['auto', 'high', 'low'] | None) – Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low". This will be passed to the created ImageContent objects. If not provided, the detail level will be the one set in the constructor.
  • size (tuple[int, int] | None) – If provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial when working with models that have resolution constraints or when transmitting images to remote services. If not provided, the size value will be the one set in the constructor.
  • page_range (list[str | int] | None) – List of page numbers and/or page ranges to convert to images. Page numbers start at 1. If None, all pages in the PDF will be converted. Pages outside the valid range (1 to number of pages) will be skipped with a warning. For example, page_range=[1, 3] will convert only the first and third pages of the document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] will convert pages 1, 2, 3, 5, 8, 10, 11, 12. If not provided, the page_range value will be the one set in the constructor.

Returns:

  • dict[str, list[ImageContent]] – A dictionary with the following keys:
  • image_contents: A list of ImageContent objects.

json

JSONConverter

Converts one or more JSON files into a text document.

Usage examples

python
import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))

converter = JSONConverter(content_key="text")
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'This is the content of my document'

Optionally, you can also provide a jq_schema string to filter the JSON source files and extra_meta_fields to extract from the filtered data:

python
import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
" slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
)

results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'

print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)
# 'for their discoveries of growth factors'

print(documents[1].meta)
# {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

init

python
__init__(
jq_schema: str | None = None,
content_key: str | None = None,
extra_meta_fields: set[str] | Literal["*"] | None = None,
store_full_path: bool = False,
)

Creates a JSONConverter component.

An optional jq_schema can be provided to extract nested data in the JSON source files. See the official jq documentation for more info on the filters syntax. If jq_schema is not set, whole JSON source files will be used to extract content.

Optionally, you can provide a content_key to specify which key in the extracted object must be set as the document's content.

If both jq_schema and content_key are set, the component will search for the content_key in the JSON object extracted by jq_schema. If the extracted data is not a JSON object, it will be skipped.

If only jq_schema is set, the extracted data must be a scalar value. If it's a JSON object or array, it will be skipped.

If only content_key is set, the source JSON file must be a JSON object, else it will be skipped.

extra_meta_fields can either be set to a set of strings or a literal "*" string. If it's a set of strings, it must specify fields in the extracted objects that must be set in the extracted documents. If a field is not found, the meta value will be None. If set to "*", all fields that are not content_key found in the filtered JSON object will be saved as metadata.

Initialization will fail if neither jq_schema nor content_key are set.

Parameters:

  • jq_schema (str | None) – Optional jq filter string to extract content. If not specified, whole JSON object will be used to extract information.
  • content_key (str | None) – Optional key to extract document content. If jq_schema is specified, the content_key will be extracted from that object.
  • extra_meta_fields (set[str] | Literal['*'] | None) – An optional set of meta keys to extract from the content. If jq_schema is specified, all keys will be extracted from that object.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> JSONConverter

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

  • JSONConverter – Deserialized component.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Converts a list of JSON files to documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – A list of file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, the length of the list must match the number of sources. If sources contain ByteStream objects, their meta will be added to the output documents.

Returns:

  • – A dictionary with the following keys:
  • documents: A list of created documents.

markdown

MarkdownToDocument

Converts a Markdown file into a text Document.

Usage example:

python
from haystack.components.converters import MarkdownToDocument
from datetime import datetime

converter = MarkdownToDocument()
results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the markdown file.'

init

python
__init__(
table_to_single_line: bool = False,
progress_bar: bool = True,
store_full_path: bool = False,
)

Create a MarkdownToDocument component.

Parameters:

  • table_to_single_line (bool) – If True converts table contents into a single line.
  • progress_bar (bool) – If True shows a progress bar when running.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Converts a list of Markdown files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

  • – A dictionary with the following keys:
  • documents: List of created Documents

msg

MSGToDocument

Converts Microsoft Outlook .msg files into Haystack Documents.

This component extracts email metadata (such as sender, recipients, CC, BCC, subject) and body content from .msg files and converts them into structured Haystack Documents. Additionally, any file attachments within the .msg file are extracted as ByteStream objects.

Example Usage

python
from haystack.components.converters.msg import MSGToDocument
from datetime import datetime

converter = MSGToDocument()
results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
attachments = results["attachments"]
print(documents[0].content)

init

python
__init__(store_full_path: bool = False) -> None

Creates a MSGToDocument component.

Parameters:

  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document] | list[ByteStream]]

Converts MSG files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

  • dict[str, list[Document] | list[ByteStream]] – A dictionary with the following keys:
  • documents: Created Documents.
  • attachments: Created ByteStream objects from file attachments.

multi_file_converter

MultiFileConverter

A file converter that handles conversion of multiple file types.

The MultiFileConverter handles the following file types:

  • CSV
  • DOCX
  • HTML
  • JSON
  • MD
  • TEXT
  • PDF (no OCR)
  • PPTX
  • XLSX

Usage example:

from haystack.super_components.converters import MultiFileConverter

converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})

init

python
__init__(encoding: str = 'utf-8', json_content_key: str = 'content') -> None

Initialize the MultiFileConverter.

Parameters:

  • encoding (str) – The encoding to use when reading files.
  • json_content_key (str) – The key to use in a content field in a document when converting JSON files.

openapi_functions

OpenAPIServiceToFunctions

Converts OpenAPI service definitions to a format suitable for OpenAI function calling.

The definition must respect OpenAPI specification 3.0.0 or higher. It can be specified in JSON or YAML format. Each function must have: - unique operationId - description - requestBody and/or parameters - schema for the requestBody and/or parameters For more details on OpenAPI specification see the official documentation. For more details on OpenAI function calling see the official documentation.

Usage example:

python
from haystack.components.converters import OpenAPIServiceToFunctions

converter = OpenAPIServiceToFunctions()
result = converter.run(sources=["path/to/openapi_definition.yaml"])
assert result["functions"]

init

python
__init__()

Create an OpenAPIServiceToFunctions component.

run

python
run(sources: list[str | Path | ByteStream]) -> dict[str, Any]

Converts OpenAPI definitions in OpenAI function calling format.

Parameters:

  • sources (list[str | Path | ByteStream]) – File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).

Returns:

  • dict[str, Any] – A dictionary with the following keys:
  • functions: Function definitions in JSON object format
  • openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references

Raises:

  • RuntimeError – If the OpenAPI definitions cannot be downloaded or processed.
  • ValueError – If the source type is not recognized or no functions are found in the OpenAPI definitions.

output_adapter

OutputAdaptationException

Bases: Exception

Exception raised when there is an error during output adaptation.

OutputAdapter

Adapts output of a Component using Jinja templates.

Usage example:

python
from haystack import Document
from haystack.components.converters import OutputAdapter

adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
documents = [Document(content="Test content"]
result = adapter.run(documents=documents)

assert result["output"] == "Test content"

init

python
__init__(
template: str,
output_type: TypeAlias,
custom_filters: dict[str, Callable] | None = None,
unsafe: bool = False,
) -> None

Create an OutputAdapter component.

Parameters:

  • template (str) – A Jinja template that defines how to adapt the input data. The variables in the template define the input of this instance. e.g. With this template:
{{ documents[0].content }}

The Component input will be documents.

  • output_type (TypeAlias) – The type of output this instance will return.
  • custom_filters (dict[str, Callable] | None) – A dictionary of custom Jinja filters used in the template.
  • unsafe (bool) – Enable execution of arbitrary code in the Jinja template. This should only be used if you trust the source of the template as it can be lead to remote code execution.

run

python
run(**kwargs)

Renders the Jinja template with the provided inputs.

Parameters:

  • kwargs – Must contain all variables used in the template string.

Returns:

  • – A dictionary with the following keys:
  • output: Rendered Jinja template.

Raises:

  • OutputAdaptationException – If template rendering fails.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> OutputAdapter

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – The dictionary to deserialize from.

Returns:

  • OutputAdapter – The deserialized component.

pdfminer

PDFMinerToDocument

Converts PDF files to Documents.

Uses pdfminer compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/

Usage example:

python
from haystack.components.converters.pdfminer import PDFMinerToDocument

converter = PDFMinerToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

init

python
__init__(
line_overlap: float = 0.5,
char_margin: float = 2.0,
line_margin: float = 0.5,
word_margin: float = 0.1,
boxes_flow: float | None = 0.5,
detect_vertical: bool = True,
all_texts: bool = False,
store_full_path: bool = False,
) -> None

Create a PDFMinerToDocument component.

Parameters:

  • line_overlap (float) – This parameter determines whether two characters are considered to be on the same line based on the amount of overlap between them. The overlap is calculated relative to the minimum height of both characters.
  • char_margin (float) – Determines whether two characters are part of the same line based on the distance between them. If the distance is less than the margin specified, the characters are considered to be on the same line. The margin is calculated relative to the width of the character.
  • word_margin (float) – Determines whether two characters on the same line are part of the same word based on the distance between them. If the distance is greater than the margin specified, an intermediate space will be added between them to make the text more readable. The margin is calculated relative to the width of the character.
  • line_margin (float) – This parameter determines whether two lines are part of the same paragraph based on the distance between them. If the distance is less than the margin specified, the lines are considered to be part of the same paragraph. The margin is calculated relative to the height of a line.
  • boxes_flow (float | None) – This parameter determines the importance of horizontal and vertical position when determining the order of text boxes. A value between -1.0 and +1.0 can be set, with -1.0 indicating that only horizontal position matters and +1.0 indicating that only vertical position matters. Setting the value to 'None' will disable advanced layout analysis, and text boxes will be ordered based on the position of their bottom left corner.
  • detect_vertical (bool) – This parameter determines whether vertical text should be considered during layout analysis.
  • all_texts (bool) – If layout analysis should be performed on text in figures.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

detect_undecoded_cid_characters

python
detect_undecoded_cid_characters(text: str) -> dict[str, Any]

Look for character sequences of CID, i.e.: characters that haven't been properly decoded from their CID format.

This is useful to detect if the text extractor is not able to extract the text correctly, e.g. if the PDF uses non-standard fonts.

A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor needs. If that map is not available the text extractor cannot decode the CID characters and will return them as is.

see: https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output

Parameters:

  • text (str) – The text to check for undecoded CID characters

Returns:

  • dict[str, Any] – A dictionary containing detection results

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Converts PDF files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of PDF file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

  • – A dictionary with the following keys:
  • documents: Created Documents

pptx

PPTXToDocument

Converts PPTX files to Documents.

Usage example:

python
from haystack.components.converters.pptx import PPTXToDocument

converter = PPTXToDocument()
results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is the text from the PPTX file.'

init

python
__init__(
store_full_path: bool = False,
link_format: Literal["markdown", "plain", "none"] = "none",
)

Create a PPTXToDocument component.

Parameters:

  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
  • link_format (Literal['markdown', 'plain', 'none']) – The format for link output. Possible options:
  • "markdown": [text](url)
  • "plain": text (url)
  • "none": Only the text is extracted, link addresses are ignored.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Converts PPTX files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

  • – A dictionary with the following keys:
  • documents: Created Documents

pypdf

PyPDFExtractionMode

Bases: Enum

The mode to use for extracting text from a PDF.

from_str

python
from_str(string: str) -> PyPDFExtractionMode

Convert a string to a PyPDFExtractionMode enum.

PyPDFToDocument

Converts PDF files to documents your pipeline can query.

This component uses the PyPDF library. You can attach metadata to the resulting documents.

Usage example

python
from haystack.components.converters.pypdf import PyPDFToDocument

converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

init

python
__init__(
*,
extraction_mode: str | PyPDFExtractionMode = PyPDFExtractionMode.PLAIN,
plain_mode_orientations: tuple = (0, 90, 180, 270),
plain_mode_space_width: float = 200.0,
layout_mode_space_vertically: bool = True,
layout_mode_scale_weight: float = 1.25,
layout_mode_strip_rotated: bool = True,
layout_mode_font_height_weight: float = 1.0,
store_full_path: bool = False
)

Create an PyPDFToDocument component.

Parameters:

  • extraction_mode (str | PyPDFExtractionMode) – The mode to use for extracting text from a PDF. Layout mode is an experimental mode that adheres to the rendered layout of the PDF.
  • plain_mode_orientations (tuple) – Tuple of orientations to look for when extracting text from a PDF in plain mode. Ignored if extraction_mode is PyPDFExtractionMode.LAYOUT.
  • plain_mode_space_width (float) – Forces default space width if not extracted from font. Ignored if extraction_mode is PyPDFExtractionMode.LAYOUT.
  • layout_mode_space_vertically (bool) – Whether to include blank lines inferred from y distance + font height. Ignored if extraction_mode is PyPDFExtractionMode.PLAIN.
  • layout_mode_scale_weight (float) – Multiplier for string length when calculating weighted average character width. Ignored if extraction_mode is PyPDFExtractionMode.PLAIN.
  • layout_mode_strip_rotated (bool) – Layout mode does not support rotated text. Set to False to include rotated text anyway. If rotated text is discovered, layout will be degraded and a warning will be logged. Ignored if extraction_mode is PyPDFExtractionMode.PLAIN.
  • layout_mode_font_height_weight (float) – Multiplier for font height when calculating blank line height. Ignored if extraction_mode is PyPDFExtractionMode.PLAIN.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

to_dict

python
to_dict()

Serializes the component to a dictionary.

Returns:

  • – Dictionary with serialized data.

from_dict

python
from_dict(data)

Deserializes the component from a dictionary.

Parameters:

  • data – Dictionary with serialized data.

Returns:

  • – Deserialized component.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Converts PDF files to documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects to convert.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources, as they are zipped together. For ByteStream objects, their meta is added to the output documents.

Returns:

  • – A dictionary with the following keys:
  • documents: A list of converted documents.

tika

XHTMLParser

Bases: HTMLParser

Custom parser to extract pages from Tika XHTML content.

handle_starttag

python
handle_starttag(tag: str, attrs: list[tuple])

Identify the start of a page div.

handle_endtag

python
handle_endtag(tag: str)

Identify the end of a page div.

handle_data

python
handle_data(data: str)

Populate the page content.

TikaDocumentConverter

Converts files of different types to Documents.

This component uses Apache Tika for parsing the files and, therefore, requires a running Tika server. For more options on running Tika, see the official documentation.

Usage example:

python
from haystack.components.converters.tika import TikaDocumentConverter

converter = TikaDocumentConverter()
results = converter.run(
sources=["sample.docx", "my_document.rtf", "archive.zip"],
meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the docx file.'

init

python
__init__(
tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
)

Create a TikaDocumentConverter component.

Parameters:

  • tika_url (str) – Tika server URL.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Converts files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of HTML file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

  • – A dictionary with the following keys:
  • documents: Created Documents

txt

TextFileToDocument

Converts text files to documents your pipeline can query.

By default, it uses UTF-8 encoding when converting files but you can also set custom encoding. It can attach metadata to the resulting documents.

Usage example

python
from haystack.components.converters.txt import TextFileToDocument

converter = TextFileToDocument()
results = converter.run(sources=["sample.txt"])
documents = results["documents"]
print(documents[0].content)
# 'This is the content from the txt file.'

init

python
__init__(encoding: str = 'utf-8', store_full_path: bool = False)

Creates a TextFileToDocument component.

Parameters:

  • encoding (str) – The encoding of the text files to convert. If the encoding is specified in the metadata of a source ByteStream, it overrides this value.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
)

Converts text files to documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of text file paths or ByteStream objects to convert.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources as they're zipped together. For ByteStream objects, their meta is added to the output documents.

Returns:

  • – A dictionary with the following keys:
  • documents: A list of converted documents.

xlsx

XLSXToDocument

Converts XLSX (Excel) files into Documents.

Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is
created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format.

### Usage example

```python
from haystack.components.converters.xlsx import XLSXToDocument

converter = XLSXToDocument()
results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# ",A,B

1,col_a,col_b 2,1.5,test " ```

init

python
__init__(
table_format: Literal["csv", "markdown"] = "csv",
sheet_name: str | int | list[str | int] | None = None,
read_excel_kwargs: dict[str, Any] | None = None,
table_format_kwargs: dict[str, Any] | None = None,
*,
link_format: Literal["markdown", "plain", "none"] = "none",
store_full_path: bool = False
)

Creates a XLSXToDocument component.

Parameters:

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]

Converts a XLSX file to a Document.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output documents.

Returns:

  • dict[str, list[Document]] – A dictionary with the following keys:
  • documents: Created documents