API reference	Astra
GitHub link	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/astra

DataStax Astra DB is a serverless vector database built on Apache Cassandra, and it supports vector-based search and auto-scaling. You can deploy it on AWS, GCP, or Azure and easily expand to one or more regions within those clouds for multi-region availability, low latency data access, data sovereignty, and to avoid cloud vendor lock-in. For more information, see the DataStax documentation.

Initialization

Once you have an AstraDB account and have created a database, install the astra-haystack integration:

pip install astra-haystack

From the configuration in AstraDB’s web UI, you need the database ID and a generated token.

You will additionally need a collection name and a namespace. When you create the collection name, you also need to set the embedding dimensions and the similarity metric. The namespace organizes data in a database and is called a keyspace in Apache Cassandra.

Then, in Haystack, initialize an AstraDocumentStore object that’s connected to the AstraDB instance, and write documents to it.

We strongly encourage passing authentication data through environment variables: make sure to populate the environment variables ASTRA_DB_API_ENDPOINT and ASTRA_DB_APPLICATION_TOKEN before running the following example.

from haystack import Document
from haystack_integrations.document_stores.astra import AstraDocumentStore

document_store = AstraDocumentStore()

document_store.write_documents([
    Document(content="This is first"),
    Document(content="This is second")
    ])
print(document_store.count_documents())

Supported Retrievers

AstraEmbeddingRetriever: An embedding-based Retriever that fetches documents from the Document Store based on a query embedding provided to the Retriever.

Additional References

🧑‍🍳 Cookbook: Using AstraDB as a data store in your Haystack pipelines