DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
Documentation

AstraEmbeddingRetriever

This is an embedding-based Retriever compatible with the Astra Document Store.

Most common position in a pipeline1. After a Text Embedder and before a PromptBuilder in a RAG pipeline 2. The last component in the semantic search pipeline 3. After a Text Embedder and before an ExtractiveReader in an extractive QA pipeline
Mandatory init variables"document_store": An instance of AstraDocumentStore
Mandatory run variables“query_embedding”: A list of floats
Output variables“documents”: A list of documents
API referenceAstra
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/astra

Overview

AstraEmbeddingRetriever compares the query and document embeddings and fetches the documents most relevant to the query from the AstraDocumentStore based on the outcome.

When using the AstraEmbeddingRetriever in your NLP system, make sure it has the query and document embeddings available. You can do so by adding a Document Embedder to your indexing pipeline and a Text Embedder to your query pipeline.

In addition to the query_embedding, the AstraEmbeddingRetriever accepts other optional parameters, including top_k (the maximum number of documents to retrieve) and filters to narrow down the search space.

Setup and installation

Once you have an AstraDB account and have created a database, install the astra-haystack integration:

pip install astra-haystack

From the configuration in AstraDB’s web UI, you need the database ID and a generated token.

You will additionally need a collection name and a namespace. When you create the collection name, you also need to set the embedding dimensions and the similarity metric. The namespace organizes data in a database and is called a keyspace in Apache Cassandra.

Then, optionally, install sentence-transformers as well to run the example below:

pip install sentence-transformers

Usage

We strongly encourage passing authentication data through environment variables: make sure to populate the environment variables ASTRA_DB_API_ENDPOINT and ASTRA_DB_APPLICATION_TOKEN before running the following example.

In a pipeline

Use this Retriever in a query pipeline like this:

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack_integrations.components.retrievers.astra import AstraEmbeddingRetriever
from haystack_integrations.document_stores.astra import AstraDocumentStore


document_store = AstraDocumentStore()

model = "sentence-transformers/all-mpnet-base-v2"

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
						Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
						Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

document_embedder = SentenceTransformersDocumentEmbedder(model=model)  
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.SKIP)

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model))
query_pipeline.add_component("retriever", AstraEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

result = query_pipeline.run({"text_embedder": {"text": query}})

print(result['retriever']['documents'][0])

The example output would be:

Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.8929937, embedding: vector of size 768)

Additional References

🧑‍🍳 Cookbook: Using AstraDB as a data store in your Haystack pipelines