Various converters to transform data from one format to another.
Module azure
AzureOCRDocumentConverter
Converts files to documents using Azure's Document Intelligence service.
Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
To use this component, you need an active Azure account and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see Azure documentation.
Usage example
from haystack.components.converters import AzureOCRDocumentConverter
from haystack.utils import Secret
converter = AzureOCRDocumentConverter(endpoint="<url>", api_key=Secret.from_token("<your-api-key>"))
results = converter.run(sources=["path/to/doc_with_images.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'
AzureOCRDocumentConverter.__init__
def __init__(endpoint: str,
api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
model_id: str = "prebuilt-read",
preceding_context_len: int = 3,
following_context_len: int = 3,
merge_multiple_column_headers: bool = True,
page_layout: Literal["natural", "single_column"] = "natural",
threshold_y: Optional[float] = 0.05,
store_full_path: bool = True)
Creates an AzureOCRDocumentConverter component.
Arguments:
endpoint
: The endpoint of your Azure resource.api_key
: The API key of your Azure resource.model_id
: The ID of the model you want to use. For a list of available models, see [Azure documentation] (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).preceding_context_len
: Number of lines before a table to include as preceding context (this will be added to the metadata).following_context_len
: Number of lines after a table to include as subsequent context ( this will be added to the metadata).merge_multiple_column_headers
: IfTrue
, merges multiple column header rows into a single row.page_layout
: The type reading order to follow. Possible options:natural
: Uses the natural reading order determined by Azure.single_column
: Groups all lines with the same height on the page based on a threshold determined bythreshold_y
.threshold_y
: Only relevant ifsingle_column
is set topage_layout
. The threshold, in inches, to determine if two recognized PDF elements are grouped into a single line. This is crucial for section headers or numbers which may be spatially separated from the remaining text on the horizontal axis.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
AzureOCRDocumentConverter.run
@component.output_types(documents=List[Document],
raw_azure_response=List[Dict])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[List[Dict[str, Any]]] = None)
Convert a list of files to Documents using Azure's Document Intelligence service.
Arguments:
sources
: List of file paths or ByteStream objects.meta
: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsources
contains ByteStream objects, theirmeta
will be added to the output Documents.
Returns:
A dictionary with the following keys:
documents
: List of created Documentsraw_azure_response
: List of raw Azure responses used to create the Documents
AzureOCRDocumentConverter.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
AzureOCRDocumentConverter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOCRDocumentConverter"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
Module csv
CSVToDocument
Converts CSV files to Documents.
By default, it uses UTF-8 encoding when converting files but
you can also set a custom encoding.
It can attach metadata to the resulting documents.
### Usage example
```python
from haystack.components.converters.csv import CSVToDocument
converter = CSVToDocument()
results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'col1,col2
ow1,row1 row2row2 ' ```
CSVToDocument.__init__
def __init__(encoding: str = "utf-8", store_full_path: bool = True)
Creates a CSVToDocument component.
Arguments:
encoding
: The encoding of the csv files to convert. If the encoding is specified in the metadata of a source ByteStream, it overrides this value.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
CSVToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts a CSV file to a Document.
Arguments:
sources
: List of file paths or ByteStream objects.meta
: Optional metadata to attach to the documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsources
contains ByteStream objects, theirmeta
will be added to the output documents.
Returns:
A dictionary with the following keys:
documents
: Created documents
Module docx
DOCXMetadata
Describes the metadata of Docx file.
Arguments:
author
: The authorcategory
: The categorycomments
: The commentscontent_status
: The content statuscreated
: The creation date (ISO formatted string)identifier
: The identifierkeywords
: Available keywordslanguage
: The language of the documentlast_modified_by
: User who last modified the documentlast_printed
: The last printed date (ISO formatted string)modified
: The last modification date (ISO formatted string)revision
: The revision numbersubject
: The subjecttitle
: The titleversion
: The version
DOCXTableFormat
Supported formats for storing DOCX tabular data in a Document.
DOCXTableFormat.from_str
@staticmethod
def from_str(string: str) -> "DOCXTableFormat"
Convert a string to a DOCXTableFormat enum.
DOCXToDocument
Converts DOCX files to Documents.
Uses python-docx
library to convert the DOCX file to a document.
This component does not preserve page breaks in the original document.
Usage example:
from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat
converter = DOCXToDocument(table_format=DOCXTableFormat.CSV)
results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the DOCX file.'
DOCXToDocument.__init__
def __init__(table_format: Union[str, DOCXTableFormat] = DOCXTableFormat.CSV,
store_full_path: bool = True)
Create a DOCXToDocument component.
Arguments:
table_format
: The format for table output. Can be either DOCXTableFormat.MARKDOWN, DOCXTableFormat.CSV, "markdown", or "csv". Defaults to DOCXTableFormat.CSV.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
DOCXToDocument.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
DOCXToDocument.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "DOCXToDocument"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
DOCXToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts DOCX files to Documents.
Arguments:
sources
: List of file paths or ByteStream objects.meta
: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsources
contains ByteStream objects, theirmeta
will be added to the output Documents.
Returns:
A dictionary with the following keys:
documents
: Created Documents
Module html
HTMLToDocument
Converts an HTML file to a Document.
Usage example:
from haystack.components.converters import HTMLToDocument
converter = HTMLToDocument()
results = converter.run(sources=["path/to/sample.html"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the HTML file.'
HTMLToDocument.__init__
def __init__(extraction_kwargs: Optional[Dict[str, Any]] = None,
store_full_path: bool = True)
Create an HTMLToDocument component.
Arguments:
extraction_kwargs
: A dictionary containing keyword arguments to customize the extraction process. These are passed to the underlying Trafilaturaextract
function. For the full list of available arguments, see the Trafilatura documentation.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
HTMLToDocument.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
HTMLToDocument.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HTMLToDocument"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
HTMLToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
extraction_kwargs: Optional[Dict[str, Any]] = None)
Converts a list of HTML files to Documents.
Arguments:
sources
: List of HTML file paths or ByteStream objects.meta
: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsources
contains ByteStream objects, theirmeta
will be added to the output Documents.extraction_kwargs
: Additional keyword arguments to customize the extraction process.
Returns:
A dictionary with the following keys:
documents
: Created Documents
Module json
JSONConverter
Converts one or more JSON files into a text document.
Usage examples
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))
converter = JSONConverter(content_key="text")
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'This is the content of my document'
Optionally, you can also provide a jq_schema
string to filter the JSON source files and extra_meta_fields
to extract from the filtered data:
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
" slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
)
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors'
print(documents[1].meta)
# {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
JSONConverter.__init__
def __init__(jq_schema: Optional[str] = None,
content_key: Optional[str] = None,
extra_meta_fields: Optional[Union[Set[str], Literal["*"]]] = None,
store_full_path: bool = True)
Creates a JSONConverter component.
An optional jq_schema
can be provided to extract nested data in the JSON source files.
See the official jq documentation for more info on the filters syntax.
If jq_schema
is not set, whole JSON source files will be used to extract content.
Optionally, you can provide a content_key
to specify which key in the extracted object must
be set as the document's content.
If both jq_schema
and content_key
are set, the component will search for the content_key
in
the JSON object extracted by jq_schema
. If the extracted data is not a JSON object, it will be skipped.
If only jq_schema
is set, the extracted data must be a scalar value. If it's a JSON object or array,
it will be skipped.
If only content_key
is set, the source JSON file must be a JSON object, else it will be skipped.
extra_meta_fields
can either be set to a set of strings or a literal "*"
string.
If it's a set of strings, it must specify fields in the extracted objects that must be set in
the extracted documents. If a field is not found, the meta value will be None
.
If set to "*"
, all fields that are not content_key
found in the filtered JSON object will
be saved as metadata.
Initialization will fail if neither jq_schema
nor content_key
are set.
Arguments:
jq_schema
: Optional jq filter string to extract content. If not specified, whole JSON object will be used to extract information.content_key
: Optional key to extract document content. Ifjq_schema
is specified, thecontent_key
will be extracted from that object.extra_meta_fields
: An optional set of meta keys to extract from the content. Ifjq_schema
is specified, all keys will be extracted from that object.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
JSONConverter.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
JSONConverter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "JSONConverter"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
JSONConverter.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts a list of JSON files to documents.
Arguments:
sources
: A list of file paths or ByteStream objects.meta
: Optional metadata to attach to the documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, the length of the list must match the number of sources. Ifsources
contain ByteStream objects, theirmeta
will be added to the output documents.
Returns:
A dictionary with the following keys:
documents
: A list of created documents.
Module markdown
MarkdownToDocument
Converts a Markdown file into a text Document.
Usage example:
from haystack.components.converters import MarkdownToDocument
from datetime import datetime
converter = MarkdownToDocument()
results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the markdown file.'
MarkdownToDocument.__init__
def __init__(table_to_single_line: bool = False,
progress_bar: bool = True,
store_full_path: bool = True)
Create a MarkdownToDocument component.
Arguments:
table_to_single_line
: If True converts table contents into a single line.progress_bar
: If True shows a progress bar when running.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
MarkdownToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts a list of Markdown files to Documents.
Arguments:
sources
: List of file paths or ByteStream objects.meta
: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsources
contains ByteStream objects, theirmeta
will be added to the output Documents.
Returns:
A dictionary with the following keys:
documents
: List of created Documents
Module openapi_functions
OpenAPIServiceToFunctions
Converts OpenAPI service definitions to a format suitable for OpenAI function calling.
The definition must respect OpenAPI specification 3.0.0 or higher. It can be specified in JSON or YAML format. Each function must have: - unique operationId - description - requestBody and/or parameters - schema for the requestBody and/or parameters For more details on OpenAPI specification see the official documentation. For more details on OpenAI function calling see the official documentation.
Usage example:
from haystack.components.converters import OpenAPIServiceToFunctions
converter = OpenAPIServiceToFunctions()
result = converter.run(sources=["path/to/openapi_definition.yaml"])
assert result["functions"]
OpenAPIServiceToFunctions.__init__
def __init__()
Create an OpenAPIServiceToFunctions component.
OpenAPIServiceToFunctions.run
@component.output_types(functions=List[Dict[str, Any]],
openapi_specs=List[Dict[str, Any]])
def run(sources: List[Union[str, Path, ByteStream]]) -> Dict[str, Any]
Converts OpenAPI definitions in OpenAI function calling format.
Arguments:
sources
: File paths or ByteStream objects of OpenAPI definitions (in JSON or YAML format).
Raises:
RuntimeError
: If the OpenAPI definitions cannot be downloaded or processed.ValueError
: If the source type is not recognized or no functions are found in the OpenAPI definitions.
Returns:
A dictionary with the following keys:
- functions: Function definitions in JSON object format
- openapi_specs: OpenAPI specs in JSON/YAML object format with resolved references
Module output_adapter
OutputAdaptationException
Exception raised when there is an error during output adaptation.
OutputAdapter
Adapts output of a Component using Jinja templates.
Usage example:
from haystack import Document
from haystack.components.converters import OutputAdapter
adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
documents = [Document(content="Test content"]
result = adapter.run(documents=documents)
assert result["output"] == "Test content"
OutputAdapter.__init__
def __init__(template: str,
output_type: TypeAlias,
custom_filters: Optional[Dict[str, Callable]] = None,
unsafe: bool = False)
Create an OutputAdapter component.
Arguments:
template
: A Jinja template that defines how to adapt the input data. The variables in the template define the input of this instance. e.g. With this template:
{{ documents[0].content }}
The Component input will be documents
.
output_type
: The type of output this instance will return.custom_filters
: A dictionary of custom Jinja filters used in the template.unsafe
: Enable execution of arbitrary code in the Jinja template. This should only be used if you trust the source of the template as it can be lead to remote code execution.
OutputAdapter.run
def run(**kwargs)
Renders the Jinja template with the provided inputs.
Arguments:
kwargs
: Must contain all variables used in thetemplate
string.
Raises:
OutputAdaptationException
: If template rendering fails.
Returns:
A dictionary with the following keys:
output
: Rendered Jinja template.
OutputAdapter.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
OutputAdapter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OutputAdapter"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
Module pdfminer
PDFMinerToDocument
Converts PDF files to Documents.
Uses pdfminer
compatible converters to convert PDF files to Documents. https://pdfminersix.readthedocs.io/en/latest/
Usage example:
from haystack.components.converters.pdfminer import PDFMinerToDocument
converter = PDFMinerToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'
PDFMinerToDocument.__init__
def __init__(line_overlap: float = 0.5,
char_margin: float = 2.0,
line_margin: float = 0.5,
word_margin: float = 0.1,
boxes_flow: Optional[float] = 0.5,
detect_vertical: bool = True,
all_texts: bool = False,
store_full_path: bool = True) -> None
Create a PDFMinerToDocument component.
Arguments:
line_overlap
: This parameter determines whether two characters are considered to be on the same line based on the amount of overlap between them. The overlap is calculated relative to the minimum height of both characters.char_margin
: Determines whether two characters are part of the same line based on the distance between them. If the distance is less than the margin specified, the characters are considered to be on the same line. The margin is calculated relative to the width of the character.word_margin
: Determines whether two characters on the same line are part of the same word based on the distance between them. If the distance is greater than the margin specified, an intermediate space will be added between them to make the text more readable. The margin is calculated relative to the width of the character.line_margin
: This parameter determines whether two lines are part of the same paragraph based on the distance between them. If the distance is less than the margin specified, the lines are considered to be part of the same paragraph. The margin is calculated relative to the height of a line.boxes_flow
: This parameter determines the importance of horizontal and vertical position when determining the order of text boxes. A value between -1.0 and +1.0 can be set, with -1.0 indicating that only horizontal position matters and +1.0 indicating that only vertical position matters. Setting the value to 'None' will disable advanced layout analysis, and text boxes will be ordered based on the position of their bottom left corner.detect_vertical
: This parameter determines whether vertical text should be considered during layout analysis.all_texts
: If layout analysis should be performed on text in figures.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
PDFMinerToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts PDF files to Documents.
Arguments:
sources
: List of PDF file paths or ByteStream objects.meta
: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsources
contains ByteStream objects, theirmeta
will be added to the output Documents.
Returns:
A dictionary with the following keys:
documents
: Created Documents
Module pptx
PPTXToDocument
Converts PPTX files to Documents.
Usage example:
from haystack.components.converters.pptx import PPTXToDocument
converter = PPTXToDocument()
results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is the text from the PPTX file.'
PPTXToDocument.__init__
def __init__(store_full_path: bool = True)
Create an PPTXToDocument component.
Arguments:
store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
PPTXToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts PPTX files to Documents.
Arguments:
sources
: List of file paths or ByteStream objects.meta
: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsources
contains ByteStream objects, theirmeta
will be added to the output Documents.
Returns:
A dictionary with the following keys:
documents
: Created Documents
Module pypdf
PyPDFConverter
A protocol that defines a converter which takes a PdfReader object and converts it into a Document object.
This is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component.
PyPDFExtractionMode
The mode to use for extracting text from a PDF.
PyPDFExtractionMode.__str__
def __str__() -> str
Convert a PyPDFExtractionMode enum to a string.
PyPDFExtractionMode.from_str
@staticmethod
def from_str(string: str) -> "PyPDFExtractionMode"
Convert a string to a PyPDFExtractionMode enum.
PyPDFToDocument
Converts PDF files to documents your pipeline can query.
This component uses the PyPDF library. You can attach metadata to the resulting documents.
Usage example
from haystack.components.converters.pypdf import PyPDFToDocument
converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'
PyPDFToDocument.__init__
def __init__(converter: Optional[PyPDFConverter] = None,
*,
extraction_mode: Union[
str, PyPDFExtractionMode] = PyPDFExtractionMode.PLAIN,
plain_mode_orientations: tuple = (0, 90, 180, 270),
plain_mode_space_width: float = 200.0,
layout_mode_space_vertically: bool = True,
layout_mode_scale_weight: float = 1.25,
layout_mode_strip_rotated: bool = True,
layout_mode_font_height_weight: float = 1.0,
store_full_path: bool = True)
Create an PyPDFToDocument component.
Arguments:
converter
: An instance of a PyPDFConverter compatible class. This is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component.
All the following parameters are applied only if converter
is None.
extraction_mode
: The mode to use for extracting text from a PDF. Layout mode is an experimental mode that adheres to the rendered layout of the PDF.plain_mode_orientations
: Tuple of orientations to look for when extracting text from a PDF in plain mode. Ignored ifextraction_mode
isPyPDFExtractionMode.LAYOUT
.plain_mode_space_width
: Forces default space width if not extracted from font. Ignored ifextraction_mode
isPyPDFExtractionMode.LAYOUT
.layout_mode_space_vertically
: Whether to include blank lines inferred from y distance + font height. Ignored ifextraction_mode
isPyPDFExtractionMode.PLAIN
.layout_mode_scale_weight
: Multiplier for string length when calculating weighted average character width. Ignored ifextraction_mode
isPyPDFExtractionMode.PLAIN
.layout_mode_strip_rotated
: Layout mode does not support rotated text. Set toFalse
to include rotated text anyway. If rotated text is discovered, layout will be degraded and a warning will be logged. Ignored ifextraction_mode
isPyPDFExtractionMode.PLAIN
.layout_mode_font_height_weight
: Multiplier for font height when calculating blank line height. Ignored ifextraction_mode
isPyPDFExtractionMode.PLAIN
.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
PyPDFToDocument.to_dict
def to_dict()
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
PyPDFToDocument.from_dict
@classmethod
def from_dict(cls, data)
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary with serialized data.
Returns:
Deserialized component.
PyPDFToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts PDF files to documents.
Arguments:
sources
: List of file paths or ByteStream objects to convert.meta
: Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources, as they are zipped together. For ByteStream objects, theirmeta
is added to the output documents.
Returns:
A dictionary with the following keys:
documents
: A list of converted documents.
Module tika
XHTMLParser
Custom parser to extract pages from Tika XHTML content.
XHTMLParser.handle_starttag
def handle_starttag(tag: str, attrs: List[tuple])
Identify the start of a page div.
XHTMLParser.handle_endtag
def handle_endtag(tag: str)
Identify the end of a page div.
XHTMLParser.handle_data
def handle_data(data: str)
Populate the page content.
TikaDocumentConverter
Converts files of different types to Documents.
This component uses Apache Tika for parsing the files and, therefore, requires a running Tika server. For more options on running Tika, see the official documentation.
Usage example:
from haystack.components.converters.tika import TikaDocumentConverter
converter = TikaDocumentConverter()
results = converter.run(
sources=["sample.docx", "my_document.rtf", "archive.zip"],
meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the docx file.'
TikaDocumentConverter.__init__
def __init__(tika_url: str = "http://localhost:9998/tika",
store_full_path: bool = True)
Create a TikaDocumentConverter component.
Arguments:
tika_url
: Tika server URL.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
TikaDocumentConverter.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts files to Documents.
Arguments:
sources
: List of HTML file paths or ByteStream objects.meta
: Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsources
contains ByteStream objects, theirmeta
will be added to the output Documents.
Returns:
A dictionary with the following keys:
documents
: Created Documents
Module txt
TextFileToDocument
Converts text files to documents your pipeline can query.
By default, it uses UTF-8 encoding when converting files but you can also set custom encoding. It can attach metadata to the resulting documents.
Usage example
from haystack.components.converters.txt import TextFileToDocument
converter = TextFileToDocument()
results = converter.run(sources=["sample.txt"])
documents = results["documents"]
print(documents[0].content)
# 'This is the content from the txt file.'
TextFileToDocument.__init__
def __init__(encoding: str = "utf-8", store_full_path: bool = True)
Creates a TextFileToDocument component.
Arguments:
encoding
: The encoding of the text files to convert. If the encoding is specified in the metadata of a source ByteStream, it overrides this value.store_full_path
: If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
TextFileToDocument.run
@component.output_types(documents=List[Document])
def run(sources: List[Union[str, Path, ByteStream]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None)
Converts text files to documents.
Arguments:
sources
: List of HTML file paths or ByteStream objects to convert.meta
: Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources as they're zipped together. For ByteStream objects, theirmeta
is added to the output documents.
Returns:
A dictionary with the following keys:
documents
: A list of converted documents.