InMemoryBM25Retriever
A keyword-based Retriever compatible with InMemoryDocumentStore.
Most common position in a pipeline | In query pipelines: In a RAG pipeline, before a PromptBuilder In a semantic search pipeline, as the last component In an extractive QA pipeline, before an ExtractiveReader |
Mandatory init variables | "document_store": An instance of InMemoryDocumentStore |
Mandatory run variables | "query": A query string |
Output variables | "documents": A list of documents (matching the query) |
API reference | Retrievers |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/retrievers/in_memory/bm25_retriever.py |
Overview
InMemoryBM25Retriever
is a keyword-based Retriever that fetches Documents matching a query from a temporary in-memory database. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.
Since the InMemoryBM25Retriever
matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data.
In addition to the query
, the InMemoryBM25Retriever
accepts other optional parameters, including top_k
(the maximum number of Documents to retrieve) and filters
to narrow down the search space.
Some relevant parameters that impact the BM25 retrieval must be defined when the corresponding InMemoryDocumentStore
is initialized: these include the specific BM25 algorithm and its parameters.
Usage
On its own
from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)
retriever = InMemoryBM25Retriever(document_store=document_store)
retriever.run(query="How many languages are spoken around the world today?")
In a Pipeline
In a RAG Pipeline
Here's an example of the Retriever in a retrieval-augmented generation pipeline:
import os
from haystack import Document
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
# Create a RAG query pipeline
prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
\nQuestion: {{question}}
\nAnswer:
"""
os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
rag_pipeline.connect("retriever", "answer_builder.documents")
# Draw the pipeline
rag_pipeline.draw("./rag_pipeline.png")
# Add Documents
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
rag_pipeline.get_component("retriever").document_store.write_documents(documents)
# Run the pipeline
question = "How many languages are there?"
result = rag_pipeline.run(
{
"retriever": {"query": question},
"prompt_builder": {"question": question},
"answer_builder": {"query": question},
}
)
print(result['answer_builder']['answers'][0])
In a Document Search Pipeline
Here's how you can use this Retriever in a document search pipeline:
from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.pipeline import Pipeline
# Create components and a query pipeline
document_store = InMemoryDocumentStore()
retriever = InMemoryBM25Retriever(document_store=document_store)
pipeline = Pipeline()
pipeline.add_component(instance=retriever, name="retriever")
# Add Documents
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents)
# Run the pipeline
result = pipeline.run(data={"retriever": {"query":"How many languages are there?"}})
print(result['retriever']['documents'][0])
Updated 4 months ago