DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
Documentation

OpenSearchBM25Retriever

This is a keyword-based Retriever that fetches Documents matching a query from an OpenSearch Document Store.

NameOpenSearchBM25Retriever
Sourcehttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/opensearch
Most common position in a pipeline1. Before a PromptBuilder in a RAG pipeline 2. The last component in the semantic search pipeline 3. Before an ExtractiveReader in an extractive QA pipeline
Mandatory input variables"query": A query string
Output variables"documents": A list of documents matching the query

Overview

OpenSearchBM25Retriever is a keyword-based Retriever that fetches Documents matching a query from an OpenSearchDocumentStore. It determines the similarity between Documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.

Since the OpenSearchBM25Retriever matches strings based on word overlap, it’s often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data.

In addition to the query, the OpenSearchBM25Retriever accepts other optional parameters, including top_k (the maximum number of Documents to retrieve) and filters to narrow down the search space.
You can adjust how inexact fuzzy matching is performed, using the fuzziness parameter.
It is also possible to specify if all terms in the query must match using the all_terms_must_match parameter, which defaults to False.

If you want more flexible matching of a query to Documents, you can use the OpenSearchEmbeddingRetriever, which uses vectors created by LLMs to retrieve relevant information.

Setup and installation

Install and run an OpenSearch instance.

If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull opensearchproject/opensearch:2.11.0
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms1024m -Xmx1024m" opensearchproject/opensearch:2.11.0

As an alternative, you can go to OpenSearch integration GitHub and start a Docker container running OpenSearch using the provided docker-compose.yml:

docker compose up

Once you have a running OpenSearch instance, install the opensearch-haystack integration:

pip install opensearch-haystack

Usage

On its own

This Retriever needs the OpensearchDocumentStore and indexed Documents to run. You can’t use it on its own.

In a RAG pipeline

Set your OPENAI_API_KEY as an environment variable and then run the following code:

from haystack_integrations.components.retrievers.opensearch  import OpenSearchBM25Retriever
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

from haystack import Document
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.document_stores.types import DuplicatePolicy

import os
api_key = os.environ['OPENAI_API_KEY']

# Create a RAG query pipeline
prompt_template = """
    Given these documents, answer the question.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{question}}
    \nAnswer:
    """

document_store = OpenSearchDocumentStore(hosts="http://localhost:9200", use_ssl=True,
verify_certs=False, http_auth=("admin", "admin"))

# Add Documents
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

# DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors
document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP)

retriever = OpenSearchBM25Retriever(document_store=document_store)
rag_pipeline = Pipeline()
rag_pipeline.add_component(name="retriever", instance=retriever)
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(api_key=api_key), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
rag_pipeline.connect("retriever", "answer_builder.documents")

question = "How many languages are spoken around the world today?"
result = rag_pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )
print(result['answer_builder']['answers'][0])

Here’s an example output:

GeneratedAnswer(
  data='Over 7,000 languages are spoken around the world today.',
  query='How many languages are spoken around the world today?',
  documents=[
    Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 7.179112),
    Document(id=7f225626ad1019b273326fbaf11308edfca6d663308a4a3533ec7787367d59a2, content: 'In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the ph...', score: 1.1426818)],
  meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 86, 'completion_tokens': 13, 'total_tokens': 99}})

Related Links

Check out the API reference in the GitHub repo or in our docs: