DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
Documentation

InMemoryBM25Retriever

A keyword-based Retriever compatible with InMemoryDocumentStore.

Most common position in a pipelineIn query pipelines:
In a RAG pipeline, before a PromptBuilder
In a semantic search pipeline, as the last component
In an extractive QA pipeline, before an ExtractiveReader
Mandatory init variables"document_store": An instance of InMemoryDocumentStore
Mandatory run variables"query": A query string
Output variables"documents": A list of documents (matching the query)
API referenceRetrievers
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/retrievers/in_memory/bm25_retriever.py

Overview

InMemoryBM25Retriever is a keyword-based Retriever that fetches Documents matching a query from a temporary in-memory database. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.

Since the InMemoryBM25Retriever matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data.

In addition to the query, the InMemoryBM25Retriever accepts other optional parameters, including top_k (the maximum number of Documents to retrieve) and filters to narrow down the search space.
Some relevant parameters that impact the BM25 retrieval must be defined when the corresponding InMemoryDocumentStore is initialized: these include the specific BM25 algorithm and its parameters.

Usage

On its own

from haystack import Document from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.document_stores.in_memory import InMemoryDocumentStore document_store = InMemoryDocumentStore() documents = [Document(content="There are over 7,000 languages spoken around the world today."), Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."), Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")] document_store.write_documents(documents=documents) retriever = InMemoryBM25Retriever(document_store=document_store) retriever.run(query="How many languages are spoken around the world today?")

In a Pipeline

In a RAG Pipeline

Here's an example of the Retriever in a retrieval-augmented generation pipeline:

import os from haystack import Document from haystack import Pipeline from haystack.components.builders.answer_builder import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import OpenAIGenerator from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.document_stores.in_memory import InMemoryDocumentStore # Create a RAG query pipeline prompt_template = """ Given these documents, answer the question.\nDocuments: {% for doc in documents %} {{ doc.content }} {% endfor %} \nQuestion: {{question}} \nAnswer: """ os.environ["OPENAI_API_KEY"] = "sk-XXXXXX" rag_pipeline = Pipeline() rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()), name="retriever") rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder") rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm") rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") rag_pipeline.connect("retriever", "prompt_builder.documents") rag_pipeline.connect("prompt_builder", "llm") rag_pipeline.connect("llm.replies", "answer_builder.replies") rag_pipeline.connect("llm.metadata", "answer_builder.metadata") rag_pipeline.connect("retriever", "answer_builder.documents") # Draw the pipeline rag_pipeline.draw("./rag_pipeline.png") # Add Documents documents = [Document(content="There are over 7,000 languages spoken around the world today."), Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."), Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")] rag_pipeline.get_component("retriever").document_store.write_documents(documents) # Run the pipeline question = "How many languages are there?" result = rag_pipeline.run( { "retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}, } ) print(result['answer_builder']['answers'][0])

In a Document Search Pipeline

Here's how you can use this Retriever in a document search pipeline:

from haystack import Document from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.pipeline import Pipeline # Create components and a query pipeline document_store = InMemoryDocumentStore() retriever = InMemoryBM25Retriever(document_store=document_store) pipeline = Pipeline() pipeline.add_component(instance=retriever, name="retriever") # Add Documents documents = [Document(content="There are over 7,000 languages spoken around the world today."), Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."), Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")] document_store.write_documents(documents) # Run the pipeline result = pipeline.run(data={"retriever": {"query":"How many languages are there?"}}) print(result['retriever']['documents'][0])

Related Links
Did this page help you?