Version: 3.0

AmazonTextractConverter

AmazonTextractConverter converts images and single-page PDFs to documents using AWS Textract. It supports plain text OCR, structured analysis of tables, forms, signatures, and layout, as well as natural-language queries over the document.


Most common position in a pipeline	Before PreProcessors, or right at the beginning of an indexing pipeline
Mandatory init variables	AWS credentials are resolved via `Secret` parameters or the default boto3 credential chain (environment variables, AWS config files, IAM roles).
Mandatory run variables	`sources`: A list of file paths or `ByteStream` objects
Output variables	`documents`: A list of documents `raw_textract_response`: A list of raw responses from the Textract API
API reference	Amazon Textract
GitHub link	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_textract
Package name	`amazon-textract-haystack`

Overview

AmazonTextractConverter takes a list of file paths or ByteStream objects as input and uses AWS Textract to extract text from images and single-page PDFs. Optionally, metadata can be attached to the documents through the meta input parameter. You need an active AWS account with access to the Textract service to use this integration. Refer to the AWS Textract documentation to set up your AWS credentials and ensure Textract is available in your selected region.

Supported input formats: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).

By default, the component uses the standard AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_DEFAULT_REGION, AWS_PROFILE) for authentication. You can also pass these as Secret objects at initialization. The component falls back to the default boto3 credential chain if no explicit credentials are provided, which makes it work with IAM roles when running on AWS infrastructure.

Operation modes

The component switches between two Textract APIs depending on how you configure it:

Plain text OCR (DetectDocumentText) – Used when feature_types is not set. This is the fastest and cheapest option, extracting raw text from the document.
Structured analysis (AnalyzeDocument) – Used when feature_types is set. You can pass any combination of "TABLES", "FORMS", "SIGNATURES", and "LAYOUT" to extract richer structural information from the document.

Natural-language queries

You can pass a list of natural-language questions through the queries parameter on run(). When queries are provided, the QUERIES feature type is added automatically and Textract returns the extracted answers in the raw response. This is useful for pulling specific fields out of forms, invoices, or receipts without writing custom parsing logic.

Usage

You need to install the amazon-textract-haystack integration to use AmazonTextractConverter:

shell

pip install amazon-textract-haystack

On its own

Basic usage with plain text OCR:

python

from haystack_integrations.components.converters.amazon_textract import (
    AmazonTextractConverter,
)

converter = AmazonTextractConverter()
result = converter.run(sources=["document.png"])
documents = result["documents"]

Extracting tables and forms with AnalyzeDocument:

python

from haystack_integrations.components.converters.amazon_textract import (
    AmazonTextractConverter,
)

converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
result = converter.run(sources=["invoice.pdf"])
documents = result["documents"]
raw_responses = result["raw_textract_response"]

Using natural-language queries to extract specific fields:

python

from haystack_integrations.components.converters.amazon_textract import (
    AmazonTextractConverter,
)

converter = AmazonTextractConverter()
result = converter.run(
    sources=["receipt.png"],
    queries=["What is the patient name?", "What is the total due?"],
)
documents = result["documents"]
raw_responses = result["raw_textract_response"]

Passing AWS credentials explicitly:

python

from haystack.utils import Secret
from haystack_integrations.components.converters.amazon_textract import (
    AmazonTextractConverter,
)

converter = AmazonTextractConverter(
    aws_access_key_id=Secret.from_env_var("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=Secret.from_env_var("AWS_SECRET_ACCESS_KEY"),
    aws_region_name=Secret.from_token("us-east-1"),
)
result = converter.run(sources=["document.png"])

In a pipeline

Here's an example of an indexing pipeline that uses Textract to extract text from images and writes the resulting documents to a Document Store:

python

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.converters.amazon_textract import (
    AmazonTextractConverter,
)

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", AmazonTextractConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component(
    "splitter",
    DocumentSplitter(split_by="sentence", split_length=5),
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

file_names = ["document.png", "invoice.pdf"]
pipeline.run({"converter": {"sources": file_names}})

Overview​

Operation modes​

Natural-language queries​

Usage​

On its own​

In a pipeline​

Overview

Operation modes

Natural-language queries

Usage

On its own

In a pipeline