Skip to main content
Version: 2.29

AmazonTextractConverter

AmazonTextractConverter converts images and single-page PDFs to documents using AWS Textract. It supports plain text OCR, structured analysis of tables, forms, signatures, and layout, as well as natural-language queries over the document.

Most common position in a pipelineBefore PreProcessors, or right at the beginning of an indexing pipeline
Mandatory init variablesAWS credentials are resolved via Secret parameters or the default boto3 credential chain (environment variables, AWS config files, IAM roles).
Mandatory run variablessources: A list of file paths or ByteStream objects
Output variablesdocuments: A list of documents

raw_textract_response: A list of raw responses from the Textract API
API referenceAmazon Textract
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_textract
Package nameamazon-textract-haystack

Overview

AmazonTextractConverter takes a list of file paths or ByteStream objects as input and uses AWS Textract to extract text from images and single-page PDFs. Optionally, metadata can be attached to the documents through the meta input parameter. You need an active AWS account with access to the Textract service to use this integration. Refer to the AWS Textract documentation to set up your AWS credentials and ensure Textract is available in your selected region.

Supported input formats: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).

By default, the component uses the standard AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_DEFAULT_REGION, AWS_PROFILE) for authentication. You can also pass these as Secret objects at initialization. The component falls back to the default boto3 credential chain if no explicit credentials are provided, which makes it work with IAM roles when running on AWS infrastructure.

Operation modes

The component switches between two Textract APIs depending on how you configure it:

  • Plain text OCR (DetectDocumentText) – Used when feature_types is not set. This is the fastest and cheapest option, extracting raw text from the document.
  • Structured analysis (AnalyzeDocument) – Used when feature_types is set. You can pass any combination of "TABLES", "FORMS", "SIGNATURES", and "LAYOUT" to extract richer structural information from the document.

Natural-language queries

You can pass a list of natural-language questions through the queries parameter on run(). When queries are provided, the QUERIES feature type is added automatically and Textract returns the extracted answers in the raw response. This is useful for pulling specific fields out of forms, invoices, or receipts without writing custom parsing logic.

Usage

You need to install the amazon-textract-haystack integration to use AmazonTextractConverter:

shell
pip install amazon-textract-haystack

On its own

Basic usage with plain text OCR:

python
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)

converter = AmazonTextractConverter()
result = converter.run(sources=["document.png"])
documents = result["documents"]

Extracting tables and forms with AnalyzeDocument:

python
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)

converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
converter.warm_up()
result = converter.run(sources=["invoice.pdf"])
documents = result["documents"]
raw_responses = result["raw_textract_response"]

Using natural-language queries to extract specific fields:

python
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)

converter = AmazonTextractConverter()
result = converter.run(
sources=["receipt.png"],
queries=["What is the patient name?", "What is the total due?"],
)
documents = result["documents"]
raw_responses = result["raw_textract_response"]

Passing AWS credentials explicitly:

python
from haystack.utils import Secret
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)

converter = AmazonTextractConverter(
aws_access_key_id=Secret.from_env_var("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=Secret.from_env_var("AWS_SECRET_ACCESS_KEY"),
aws_region_name=Secret.from_token("us-east-1"),
)
result = converter.run(sources=["document.png"])

In a pipeline

Here's an example of an indexing pipeline that uses Textract to extract text from images and writes the resulting documents to a Document Store:

python
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", AmazonTextractConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component(
"splitter",
DocumentSplitter(split_by="sentence", split_length=5),
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

file_names = ["document.png", "invoice.pdf"]
pipeline.run({"converter": {"sources": file_names}})