DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord

AmazonBedrockDocumentEmbedder

This component computes embeddings for Documents using models through Amazon Bedrock API.

NameAmazonBedrockDocumentEmbedder
Pathhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_bedrock
Most common Position in a PipelineBefore a DocumentWriter in an indexing Pipeline
Mandatory Input variables“documents”: a list of Document objects to be embedded
Output variables“documents”: a list of Document objects (enriched with embeddings)

Overview

Amazon Bedrock is a fully managed service that makes language models from leading AI startups and Amazon available for your use through a unified API.

Supported models are amazon.titan-embed-text-v1, cohere.embed-english-v3 and cohere.embed-multilingual-v3.

📘

Batch Inference

Note that only Cohere models support batch inference – computing embeddings for more Documents with the same request.

This component should be used to embed a list of Documents. To embed a string, you should use the AmazonBedrockTextEmbedder.

Authentication

AmazonBedrockDocumentEmbedder uses AWS for authentication. You can either provide credentials as parameters directly to the component or use the AWS CLI and authenticate through your IAM. For more information on how to set up an IAM identity-based policy, see the official documentation.
To initialize AmazonBedrockDocumentEmbedder and authenticate by providing credentials, provide the model_name, as well as aws_access_key_idaws_secret_access_key and aws_region_name. Other parameters are optional. You can check them out in our API reference.

Model-specific parameters

Even if Haystack provides a unified interface, each model offered by Bedrock can accept specific parameters. You can pass these parameters at initialization.

For example, Cohere models support input_type and truncate, as seen in Bedrock documentation.

from haystack_integrations.components.embedders.amazon_bedrock import AmazonBedrockDocumentEmbedder

embedder = AmazonBedrockDocumentEmbedder(model="cohere.embed-english-v3",
																				 input_type="search_document"
																				 truncate="LEFT")

Embedding Metadata

Text Documents often come with a set of metadata. If they are distinctive and semantically meaningful, you can embed them along with the text of the Document to improve retrieval.

You can do this easily by using the Document Embedder:

from haystack import Document
from haystack_integrations.components.embedders.amazon_bedrock import AmazonBedrockDocumentEmbedder

doc = Document(content="some text",meta={"title": "relevant title", "page number": 18})

embedder = AmazonBedrockDocumentEmbedder(model="cohere.embed-english-v3",
																					meta_fields_to_embed=["title"])

docs_w_embeddings = embedder.run(documents=[doc])["documents"]

Usage

Installation

You need to install amazon-bedrock-haystack package to use the AmazonBedrockTextEmbedder:

pip install amazon-bedrock-haystack

On its own

Basic usage:

import os
from haystack_integrations.components.embedders.amazon_bedrock import AmazonBedrockDocumentEmbedder
from haystack.dataclasses import DOcument

os.environ["AWS_ACCESS_KEY_ID"] = "..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "..."
os.environ["AWS_DEFAULT_REGION"] = "us-east-1" # just an example

doc = Document(content="I love pizza!")

embedder = AmazonBedrockDocumentEmbedder(model="cohere.embed-english-v3",
																					input_type="search_document"

result = document_embedder.run([doc])
print(result['documents'][0].embedding)

# [0.017020374536514282, -0.023255806416273117, ...]

In a pipeline

In a RAG pipeline:

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.embedders.amazon_bedrock import (
    AmazonBedrockDocumentEmbedder,
    AmazonBedrockTextEmbedder,
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

documents = [Document(content="My name is Wolfgang and I live in Berlin"),
             Document(content="I saw a black horse running"),
             Document(content="Germany has many big cities")]

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("embedder", AmazonBedrockDocumentEmbedder(
	model="cohere.embed-english-v3"))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("embedder", "writer")

indexing_pipeline.run({"embedder": {"documents": documents}})

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", AmazonBedrockTextEmbedder(model="cohere.embed-english-v3"))
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "Who lives in Berlin?"

result = query_pipeline.run({"text_embedder":{"text": query}})

print(result['retriever']['documents'][0])

# Document(id=..., content: 'My name is Wolfgang and I live in Berlin')

Related Links

Check out the API reference in the GitHub repo or in our docs: