AstraDocumentStore
API reference | Astra |
GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/astra |
DataStax Astra DB is a serverless vector database built on Apache Cassandra, and it supports vector-based search and auto-scaling. You can deploy it on AWS, GCP, or Azure and easily expand to one or more regions within those clouds for multi-region availability, low latency data access, data sovereignty, and to avoid cloud vendor lock-in. For more information, see the DataStax documentation.
Initialization
Once you have an AstraDB account and have created a database, install the astra-haystack
integration:
pip install astra-haystack
From the configuration in AstraDB’s web UI, you need the database ID and a generated token.
You will additionally need a collection name and a namespace. When you create the collection name, you also need to set the embedding dimensions and the similarity metric. The namespace organizes data in a database and is called a keyspace in Apache Cassandra.
Then, in Haystack, initialize an AstraDocumentStore
object that’s connected to the AstraDB instance, and write documents to it.
We strongly encourage passing authentication data through environment variables: make sure to populate the environment variables ASTRA_DB_API_ENDPOINT
and ASTRA_DB_APPLICATION_TOKEN
before running the following example.
from haystack import Document
from haystack_integrations.document_stores.astra import AstraDocumentStore
document_store = AstraDocumentStore()
document_store.write_documents([
Document(content="This is first"),
Document(content="This is second")
])
print(document_store.count_documents())
Supported Retrievers
AstraEmbeddingRetriever: An embedding-based Retriever that fetches documents from the Document Store based on a query embedding provided to the Retriever.
Additional References
🧑🍳 Cookbook: Using AstraDB as a data store in your Haystack pipelines
Updated 3 months ago