ArangoEmbeddingRetriever
An embedding-based Retriever compatible with the ArangoDB Document Store.
| Most common position in a pipeline | 1. After a Text Embedder and before a PromptBuilder in a RAG pipeline 2. The last component in a semantic search pipeline |
| Mandatory init variables | document_store: An instance of an ArangoDocumentStore |
| Mandatory run variables | query_embedding: A vector representing the query (a list of floats) |
| Output variables | documents: A list of documents |
| API reference | ArangoDB |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/arangodb |
| Package name | arangodb-haystack |
Overview
The ArangoEmbeddingRetriever retrieves documents from an ArangoDocumentStore using ArangoDB's AQL vector functions. It compares the query embedding with document embeddings and returns the most similar documents.
In addition to query_embedding, the retriever accepts optional filters to narrow the search space and top_k to limit the number of results. Both can be set at initialization and overridden per call to run().
The embedding dimension and similarity function (cosine, dot_product, or l2) are configured on the ArangoDocumentStore at initialization time.
Installation
Ensure ArangoDB 3.12+ is running with the vector index enabled, for example via Docker:
shell
docker run -d -p 8529:8529 \
-e ARANGO_ROOT_PASSWORD=test-password \
arangodb:3.12 arangod --vector-index
Usage
On its own
python
from haystack import Document
from haystack_integrations.document_stores.arangodb import ArangoDocumentStore
from haystack_integrations.components.retrievers.arangodb import (
ArangoEmbeddingRetriever,
)
document_store = ArangoDocumentStore(
host="http://localhost:8529",
embedding_dimension=3,
recreate_collection=True,
)
document_store.write_documents(
[
Document(
content="There are over 7,000 languages spoken around the world today.",
embedding=[0.1, 0.2, 0.3],
),
Document(
content="Elephants have been observed to recognize themselves in mirrors.",
embedding=[0.8, 0.1, 0.5],
),
],
)
retriever = ArangoEmbeddingRetriever(document_store=document_store, top_k=1)
result = retriever.run(query_embedding=[0.1, 0.2, 0.3])
print(result["documents"][0].content)
In a pipeline
python
from haystack import Document, Pipeline
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.embedders import (
SentenceTransformersDocumentEmbedder,
SentenceTransformersTextEmbedder,
)
from haystack_integrations.document_stores.arangodb import ArangoDocumentStore
from haystack_integrations.components.retrievers.arangodb import (
ArangoEmbeddingRetriever,
)
document_store = ArangoDocumentStore(
host="http://localhost:8529",
embedding_dimension=384,
recreate_collection=True,
)
documents = [
Document(content="There are over 7,000 languages spoken around the world today."),
Document(
content="Elephants have been observed to recognize themselves in mirrors.",
),
Document(
content="Bioluminescent waves can be seen in the Maldives and Puerto Rico.",
),
]
document_embedder = SentenceTransformersDocumentEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2",
)
documents_with_embeddings = document_embedder.run(documents)
document_store.write_documents(
documents_with_embeddings["documents"],
policy=DuplicatePolicy.OVERWRITE,
)
query_pipeline = Pipeline()
query_pipeline.add_component(
"text_embedder",
SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
)
query_pipeline.add_component(
"retriever",
ArangoEmbeddingRetriever(document_store=document_store, top_k=3),
)
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
result = query_pipeline.run(
{"text_embedder": {"text": "How many languages are there?"}},
)
print(result["retriever"]["documents"][0].content)