DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord🎨 Studio
Documentation

S3Downloader

S3Downloader downloads files from AWS S3 buckets to the local filesystem and enriches documents with the local file path.

Most common position in a pipelineBefore File Converters or Routers that need local file paths
Mandatory init variables"file_root_path": Path where files will be downloaded. Can be set with FILE_ROOT_PATH env var.

"aws_access_key_id": AWS access key ID. Can be set with AWS_ACCESS_KEY_ID env var.

"aws_secret_access_key": AWS secret access key. Can be set with AWS_SECRET_ACCESS_KEY env var.

"aws_region_name": AWS region name. Can be set with AWS_DEFAULT_REGION env var.
Mandatory run variables"documents": A list of documents containing name of the file to download in metadata.
Output variables"documents": A list of documents enriched with the local file path in meta['file_path']
API referenceS3Downloader
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_bedrock

Overview

S3Downloader downloads files from AWS S3 buckets to your local filesystem and enriches Document objects with the local file path. This component is useful for pipelines that need to process files stored in S3, such as PDFs, images, or text files.

The component supports AWS authentication through environment variables by default. You can set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION environment variables. Alternatively, you can pass credentials directly at initialization using the Secret API:

from haystack.utils import Secret
from haystack_integrations.components.downloaders.s3 import S3Downloader

downloader = S3Downloader(
    aws_access_key_id=Secret.from_token("<your-access-key-id>"),
    aws_secret_access_key=Secret.from_token("<your-secret-access-key>"),
    aws_region_name=Secret.from_token("<your-region>"),
    file_root_path="/path/to/download/directory"
)

The component downloads multiple files in parallel using the max_workers parameter (default is 32 workers) to speed up processing of large document sets. Downloaded files are cached locally, and when the cache exceeds max_cache_size (default is 100 files), least recently accessed files are automatically removed. Already downloaded files are touched to update their access time without re-downloading.

πŸ“˜

Required Configuration

The component requires two critical configurations:

  1. file_root_path parameter or FILE_ROOT_PATH environment variable: Specifies where files will be downloaded. This directory will be created if it doesn't exist when warm_up() is called.
  2. S3_DOWNLOADER_BUCKET environment variable: Specifies which S3 bucket to download files from.

The optional environment variable S3_DOWNLOADER_PREFIX can be set to add a prefix of the files to all generated S3 keys.

File Extension Filtering

You can use the file_extensions parameter to download only specific file types, reducing unnecessary downloads and processing time. For example, file_extensions=[".pdf", ".txt"] downloads only PDF and TXT files while skipping others.

Custom S3 Key Generation

By default, the component uses the file_name from Document metadata as the S3 key. If your S3 file structure doesn't match the file names in metadata, you can provide an optional s3_key_generation_function to customize how S3 keys are generated from Document metadata.

Usage

You need to install the amazon-bedrock-haystack package to use S3Downloader:

pip install amazon-bedrock-haystack

On its own

Before running the examples, ensure you have set the required environment variables:

export AWS_ACCESS_KEY_ID="<your-access-key-id>"
export AWS_SECRET_ACCESS_KEY="<your-secret-access-key>"
export AWS_DEFAULT_REGION="<your-region>"
export S3_DOWNLOADER_BUCKET="<your-bucket-name>"

Here's how to use S3Downloader to download files from S3:

from haystack.dataclasses import Document
from haystack_integrations.components.downloaders.s3 import S3Downloader

# Create documents with file names in metadata
documents = [
    Document(meta={"file_name": "report.pdf"}),
    Document(meta={"file_name": "data.txt"}),
]

# Initialize the downloader
downloader = S3Downloader(file_root_path="/tmp/s3_downloads")

# Warm up the component
downloader.warm_up()

# Download the files
result = downloader.run(documents=documents)

# Access the downloaded files
for doc in result["documents"]:
    print(f"File downloaded to: {doc.meta['file_path']}")

With file extension filtering:

from haystack.dataclasses import Document
from haystack_integrations.components.downloaders.s3 import S3Downloader

documents = [
    Document(meta={"file_name": "report.pdf"}),
    Document(meta={"file_name": "image.png"}),
    Document(meta={"file_name": "data.txt"}),
]

# Only download PDF files
downloader = S3Downloader(
    file_root_path="/tmp/s3_downloads",
    file_extensions=[".pdf"]
)

downloader.warm_up()

result = downloader.run(documents=documents)

# Only report.pdf is downloaded
print(f"Downloaded {len(result['documents'])} file(s)")
# Output: Downloaded 1 file(s)

With custom S3 key generation:

from haystack.dataclasses import Document
from haystack_integrations.components.downloaders.s3 import S3Downloader

def custom_s3_key_function(document: Document) -> str:
    """Generate S3 key from custom metadata."""
    folder = document.meta.get("folder", "default")
    file_name = document.meta.get("file_name")
    if not file_name:
        raise ValueError("Document must have 'file_name' in metadata")
    return f"{folder}/{file_name}"

documents = [
    Document(meta={"file_name": "report.pdf", "folder": "reports/2025"}),
]

downloader = S3Downloader(
    file_root_path="/tmp/s3_downloads",
    s3_key_generation_function=custom_s3_key_function
)

downloader.warm_up()
result = downloader.run(documents=documents)

In a pipeline

Here's an example of using S3Downloader in a document processing pipeline:

from haystack import Pipeline
from haystack.components.converters import PDFMinerToDocument
from haystack.components.routers import DocumentTypeRouter
from haystack.dataclasses import Document

from haystack_integrations.components.downloaders.s3 import S3Downloader

# Create a pipeline
pipe = Pipeline()

# Add S3Downloader to download files from S3
pipe.add_component(
    "downloader", 
    S3Downloader(
        file_root_path="/tmp/s3_downloads",
        file_extensions=[".pdf", ".txt"]
    )
)

# Route documents by file type
pipe.add_component(
    "router", 
    DocumentTypeRouter(
        file_path_meta_field="file_path",
        mime_types=["application/pdf", "text/plain"]
    )
)

# Convert PDFs to documents
pipe.add_component("pdf_converter", PDFMinerToDocument())

# Connect components
pipe.connect("downloader.documents", "router.documents")
pipe.connect("router.application/pdf", "pdf_converter.documents")

# Create documents with S3 file names
documents = [
    Document(meta={"file_name": "report.pdf"}),
    Document(meta={"file_name": "summary.txt"}),
]

# Run the pipeline
result = pipe.run({"downloader": {"documents": documents}})

For a more complex example with image processing and LLM:

from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.converters.image import DocumentToImageContent
from haystack.components.routers import DocumentTypeRouter
from haystack.dataclasses import Document

from haystack_integrations.components.downloaders.s3 import S3Downloader
from haystack_integrations.components.generators.amazon_bedrock import AmazonBedrockChatGenerator

# Create documents with file names
documents = [
    Document(meta={"file_name": "chart.png"}),
    Document(meta={"file_name": "report.pdf"}),
]

# Create pipeline
pipe = Pipeline()

# Download files from S3
pipe.add_component(
    "downloader",
    S3Downloader(file_root_path="/tmp/s3_downloads")
)

# Route by document type
pipe.add_component(
    "router",
    DocumentTypeRouter(
        file_path_meta_field="file_path",
        mime_types=["image/png", "application/pdf"]
    )
)

# Convert images for LLM
pipe.add_component("image_converter", DocumentToImageContent(detail="auto"))

# Create chat prompt with template
template = """{% message role="user" %}
Answer the question based on the provided images.

Question: {{ question }}

{% for image in image_contents %}
{{ image | templatize_part }}
{% endfor %}
{% endmessage %}"""

pipe.add_component(
    "prompt_builder",
    ChatPromptBuilder(template=template)
)

# Generate response
pipe.add_component(
    "llm",
    AmazonBedrockChatGenerator(model="anthropic.claude-3-haiku-20240307-v1:0")
)

# Connect components
pipe.connect("downloader.documents", "router.documents")
pipe.connect("router.image/png", "image_converter.documents")
pipe.connect("image_converter.image_contents", "prompt_builder.image_contents")
pipe.connect("prompt_builder.prompt", "llm.messages")

# Run pipeline
result = pipe.run({
    "downloader": {"documents": documents},
    "prompt_builder": {"question": "What information is shown in the chart?"}
})