AmazonTextractConverter
AmazonTextractConverter converts images and single-page PDFs to documents using AWS Textract. It supports plain text OCR, structured analysis of tables, forms, signatures, and layout, as well as natural-language queries over the document.
| Most common position in a pipeline | Before PreProcessors, or right at the beginning of an indexing pipeline |
| Mandatory init variables | AWS credentials are resolved via Secret parameters or the default boto3 credential chain (environment variables, AWS config files, IAM roles). |
| Mandatory run variables | sources: A list of file paths or ByteStream objects |
| Output variables | documents: A list of documents raw_textract_response: A list of raw responses from the Textract API |
| API reference | Amazon Textract |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_textract |
| Package name | amazon-textract-haystack |
Overview
AmazonTextractConverter takes a list of file paths or ByteStream objects as input and uses AWS Textract to extract text from images and single-page PDFs. Optionally, metadata can be attached to the documents through the meta input parameter. You need an active AWS account with access to the Textract service to use this integration. Refer to the AWS Textract documentation to set up your AWS credentials and ensure Textract is available in your selected region.
Supported input formats: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).
By default, the component uses the standard AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_DEFAULT_REGION, AWS_PROFILE) for authentication. You can also pass these as Secret objects at initialization. The component falls back to the default boto3 credential chain if no explicit credentials are provided, which makes it work with IAM roles when running on AWS infrastructure.
Operation modes
The component switches between two Textract APIs depending on how you configure it:
- Plain text OCR (
DetectDocumentText) – Used whenfeature_typesis not set. This is the fastest and cheapest option, extracting raw text from the document. - Structured analysis (
AnalyzeDocument) – Used whenfeature_typesis set. You can pass any combination of"TABLES","FORMS","SIGNATURES", and"LAYOUT"to extract richer structural information from the document.
Natural-language queries
You can pass a list of natural-language questions through the queries parameter on run(). When queries are provided, the QUERIES feature type is added automatically and Textract returns the extracted answers in the raw response. This is useful for pulling specific fields out of forms, invoices, or receipts without writing custom parsing logic.
Usage
You need to install the amazon-textract-haystack integration to use AmazonTextractConverter:
On its own
Basic usage with plain text OCR:
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)
converter = AmazonTextractConverter()
result = converter.run(sources=["document.png"])
documents = result["documents"]
Extracting tables and forms with AnalyzeDocument:
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
converter.warm_up()
result = converter.run(sources=["invoice.pdf"])
documents = result["documents"]
raw_responses = result["raw_textract_response"]
Using natural-language queries to extract specific fields:
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)
converter = AmazonTextractConverter()
result = converter.run(
sources=["receipt.png"],
queries=["What is the patient name?", "What is the total due?"],
)
documents = result["documents"]
raw_responses = result["raw_textract_response"]
Passing AWS credentials explicitly:
from haystack.utils import Secret
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)
converter = AmazonTextractConverter(
aws_access_key_id=Secret.from_env_var("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=Secret.from_env_var("AWS_SECRET_ACCESS_KEY"),
aws_region_name=Secret.from_token("us-east-1"),
)
result = converter.run(sources=["document.png"])
In a pipeline
Here's an example of an indexing pipeline that uses Textract to extract text from images and writes the resulting documents to a Document Store:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.converters.amazon_textract import (
AmazonTextractConverter,
)
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", AmazonTextractConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component(
"splitter",
DocumentSplitter(split_by="sentence", split_length=5),
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")
file_names = ["document.png", "invoice.pdf"]
pipeline.run({"converter": {"sources": file_names}})