This Retriever combines embedding-based retrieval and BM25 text search search to find matching documents in the search index to get more relevant results.


Most common position in a pipeline	1. After a TextEmbedder and before a `PromptBuilder` in a RAG pipeline 2. The last component in a hybrid search pipeline 3. After a TextEmbedder and before an `ExtractiveReader` in an extractive QA pipeline
Mandatory init variables	"document_store": An instance of `AzureAISearchDocumentStore`
Mandatory run variables	"query": A string ”query_embedding”: A list of floats
Output variables	“documents”: A list of documents (matching the query)
API reference	Azure AI Search
GitHub link	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_ai_search

Overview

The AzureAISearchHybridRetriever combines vector retrieval and BM25 text search to fetch relevant documents from the AzureAISearchDocumentStore. It processes both textual (keyword) queries and query embeddings in a single request, executing all subqueries in parallel. The results are merged and reordered using Reciprocal Rank Fusion (RRF) to create a unified result set.

Besides the query and query_embedding, the AzureAISearchHybridRetriever accepts optional parameters such as top_k (the maximum number of documents to retrieve) and filters to refine the search. Additional keyword arguments can also be passed during initialization for further customization.

If your search index includes a semantic configuration, you can enable semantic ranking to apply it to the Retriever's results. For more details, refer to the Azure AI documentation.

For purely keyword-based retrieval, you can use AzureAISearchBM25Retriever, and for embedding-based retrieval, AzureAISearchEmbeddingRetriever is available.

Usage

Installation

This integration requires you to have an active Azure subscription with a deployed Azure AI Search service.

To start using Azure AI search with Haystack, install the package with:

pip install azure-ai-search-haystack

On its own

This Retriever needs AzureAISearchDocumentStore and indexed documents to run.

from haystack import Document
from haystack_integrations.components.retrievers.azure_ai_search import AzureAISearchHybridRetriever
from haystack_integrations.document_stores.azure_ai_search import AzureAISearchDocumentStore

document_store = AzureAISearchDocumentStore(index_name="haystack_docs")
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)

retriever = AzureAISearchHybridRetriever(document_store=document_store)
# fake embeddings to keep the example simple
retriever.run(query="How many languages are spoken around the world today?", query_embedding=[0.1]*384)

In a RAG pipeline

The following example demonstrates using the AzureAISearchHybridRetriever in a pipeline. An indexing pipeline is responsible for indexing and storing documents with embeddings in the AzureAISearchDocumentStore, while the query pipeline uses hybrid retrieval to fetch relevant documents based on a given query.

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.writers import DocumentWriter

from haystack_integrations.components.retrievers.azure_ai_search import AzureAISearchHybridRetriever
from haystack_integrations.document_stores.azure_ai_search import AzureAISearchDocumentStore

document_store = AzureAISearchDocumentStore(index_name="hybrid-retrieval-example")

model = "sentence-transformers/all-mpnet-base-v2"

documents = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(
        content="""Elephants have been observed to behave in a way that indicates a
         high level of self-awareness, such as recognizing themselves in mirrors."""
    ),
    Document(
        content="""In certain parts of the world, like the Maldives, Puerto Rico, and
          San Diego, you can witness the phenomenon of bioluminescent waves."""
    ),
]

document_embedder = SentenceTransformersDocumentEmbedder(model=model)
document_embedder.warm_up()

# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=document_embedder, name="doc_embedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="doc_writer")
indexing_pipeline.connect("doc_embedder", "doc_writer")

indexing_pipeline.run({"doc_embedder": {"documents": documents}})

# Query Pipeline
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model))
query_pipeline.add_component("retriever", AzureAISearchHybridRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

result = query_pipeline.run({"text_embedder": {"text": query}, "retriever": {"query": query}})

print(result["retriever"]["documents"][0])