Version: 3.0

InMemoryBM25Retriever

A keyword-based Retriever compatible with InMemoryDocumentStore.


Most common position in a pipeline	In query pipelines: In a RAG pipeline, before a `PromptBuilder` In a semantic search pipeline, as the last component In an extractive QA pipeline, before a `TransformersExtractiveReader`
Mandatory init variables	`document_store`: An instance of InMemoryDocumentStore
Mandatory run variables	`query`: A query string
Output variables	`documents`: A list of documents (matching the query)
API reference	Retrievers
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/retrievers/in_memory/bm25_retriever.py
Package name	`haystack-ai`

Overview

InMemoryBM25Retriever is a keyword-based Retriever that fetches Documents matching a query from a temporary in-memory database. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.

Since the InMemoryBM25Retriever matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data.

In addition to the query, the InMemoryBM25Retriever accepts other optional parameters, including top_k (the maximum number of Documents to retrieve) and filters to narrow down the search space. Some relevant parameters that impact the BM25 retrieval must be defined when the corresponding InMemoryDocumentStore is initialized: these include the specific BM25 algorithm and its parameters.

Usage

On its own

python

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
documents = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(
        content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
    ),
    Document(
        content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
    ),
]
document_store.write_documents(documents=documents)

retriever = InMemoryBM25Retriever(document_store=document_store)
retriever.run(query="How many languages are spoken around the world today?")

In a Pipeline

In a RAG Pipeline

Here's an example of the Retriever in a retrieval-augmented generation pipeline:

python

import os
from haystack import Document
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.dataclasses import ChatMessage
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Create a RAG query pipeline
prompt_template = [
    ChatMessage.from_user(
        """
    Given these documents, answer the question.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{question}}
    \nAnswer:
    """,
    ),
]

os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()),
    name="retriever",
)
rag_pipeline.add_component(
    instance=ChatPromptBuilder(template=prompt_template, required_variables="*"),
    name="prompt_builder",
)
rag_pipeline.add_component(instance=OpenAIChatGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

# Draw the pipeline
rag_pipeline.draw("./rag_pipeline.png")

# Add Documents
documents = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(
        content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
    ),
    Document(
        content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
    ),
]
rag_pipeline.get_component("retriever").document_store.write_documents(documents)

# Run the pipeline
question = "How many languages are there?"
result = rag_pipeline.run(
    {
        "retriever": {"query": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    },
)
print(result["answer_builder"]["answers"][0])

In a Document Search Pipeline

Here's how you can use this Retriever in a document search pipeline:

python

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.pipeline import Pipeline

# Create components and a query pipeline
document_store = InMemoryDocumentStore()
retriever = InMemoryBM25Retriever(document_store=document_store)

pipeline = Pipeline()
pipeline.add_component(instance=retriever, name="retriever")

# Add Documents
documents = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(
        content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
    ),
    Document(
        content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
    ),
]
document_store.write_documents(documents)

# Run the pipeline
result = pipeline.run(data={"retriever": {"query": "How many languages are there?"}})

print(result["retriever"]["documents"][0])

Overview​

Usage​

On its own​

In a Pipeline​

In a RAG Pipeline​

In a Document Search Pipeline​

Overview

Usage

On its own

In a Pipeline

In a RAG Pipeline

In a Document Search Pipeline