Azure Form Recognizer
haystack_integrations.components.converters.azure_form_recognizer.converter
AzureOCRDocumentConverter
Converts files to documents using Azure's Document Intelligence service.
Supported file formats are: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.
To use this component, you need an active Azure account and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see Azure documentation.
Usage example
python
import os
from datetime import datetime
from haystack_integrations.components.converters.azure_form_recognizer import AzureOCRDocumentConverter
from haystack.utils import Secret
converter = AzureOCRDocumentConverter(
endpoint=os.environ["CORE_AZURE_CS_ENDPOINT"],
api_key=Secret.from_env_var("CORE_AZURE_CS_API_KEY"),
)
results = converter.run(
sources=["test/test_files/pdf/react_paper.pdf"],
meta={"date_added": datetime.now().isoformat()},
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'
init
python
__init__(
endpoint: str,
api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
model_id: str = "prebuilt-read",
preceding_context_len: int = 3,
following_context_len: int = 3,
merge_multiple_column_headers: bool = True,
page_layout: Literal["natural", "single_column"] = "natural",
threshold_y: float | None = 0.05,
store_full_path: bool = False,
) -> None
Creates an AzureOCRDocumentConverter component.
Parameters:
- endpoint (
str) – The endpoint of your Azure resource. - api_key (
Secret) – The API key of your Azure resource. - model_id (
str) – The ID of the model you want to use. For a list of available models, see [Azure documentation] (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature). - preceding_context_len (
int) – Number of lines before a table to include as preceding context (this will be added to the metadata). - following_context_len (
int) – Number of lines after a table to include as subsequent context ( this will be added to the metadata). - merge_multiple_column_headers (
bool) – IfTrue, merges multiple column header rows into a single row. - page_layout (
Literal['natural', 'single_column']) – The type reading order to follow. Possible options: natural: Uses the natural reading order determined by Azure.single_column: Groups all lines with the same height on the page based on a threshold determined bythreshold_y.- threshold_y (
float | None) – Only relevant ifsingle_columnis set topage_layout. The threshold, in inches, to determine if two recognized PDF elements are grouped into a single line. This is crucial for section headers or numbers which may be spatially separated from the remaining text on the horizontal axis. - store_full_path (
bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
run
python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, Any]
Convert a list of files to Documents using Azure's Document Intelligence service.
Parameters:
- sources (
list[str | Path | ByteStream]) – List of file paths or ByteStream objects. - meta (
dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
dict[str, Any]– A dictionary with the following keys:documents: List of created Documentsraw_azure_response: List of raw Azure responses used to create the Documents
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary to deserialize from.
Returns:
AzureOCRDocumentConverter– The deserialized component.