DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Sweeps through a Document Store and returns a set of candidate Documents that are relevant to the query.

Module in_memory/bm25_retriever

InMemoryBM25Retriever

Retrieves documents using the BM25 algorithm.

Usage example:

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Python is a popular programming language"),
    Document(content="python ist eine beliebte Programmiersprache"),
]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store)

result = retriever.run(query="Programmiersprache")

print(result["documents"])

InMemoryBM25Retriever.__init__

def __init__(document_store: InMemoryDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             scale_score: bool = False,
             filter_policy: FilterPolicy = FilterPolicy.REPLACE)

Create the InMemoryBM25Retriever component.

Arguments:

  • document_store: An instance of InMemoryDocumentStore.
  • filters: A dictionary with filters to narrow down the search space.
  • top_k: The maximum number of documents to retrieve.
  • scale_score: Scales the BM25 score to a unit interval in the range of 0 to 1, where 1 means extremely relevant. If set to False, uses raw similarity scores.
  • filter_policy: The filter policy to apply during retrieval.

Raises:

  • ValueError: If the specified top_k is not > 0.

InMemoryBM25Retriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

InMemoryBM25Retriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "InMemoryBM25Retriever"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

InMemoryBM25Retriever.run

@component.output_types(documents=List[Document])
def run(query: str,
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None,
        scale_score: Optional[bool] = None)

Run the InMemoryBM25Retriever on the given input data.

Arguments:

  • query: The query string for the Retriever.
  • filters: A dictionary with filters to narrow down the search space.
  • top_k: The maximum number of documents to return.
  • scale_score: Scales the BM25 score to a unit interval in the range of 0 to 1, where 1 means extremely relevant. If set to False, uses raw similarity scores. If not specified, the value provided at initialization is used.

Raises:

  • ValueError: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance.

Returns:

The retrieved documents.

Module in_memory/embedding_retriever

InMemoryEmbeddingRetriever

Retrieves documents using vector similarity.

Usage example:

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Python is a popular programming language"),
    Document(content="python ist eine beliebte Programmiersprache"),
]
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()
docs_with_embeddings = doc_embedder.run(docs)["documents"]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs_with_embeddings)
retriever = InMemoryEmbeddingRetriever(doc_store)

query="Programmiersprache"
text_embedder = SentenceTransformersTextEmbedder()
text_embedder.warm_up()
query_embedding = text_embedder.run(query)["embedding"]

result = retriever.run(query_embedding=query_embedding)

print(result["documents"])

InMemoryEmbeddingRetriever.__init__

def __init__(document_store: InMemoryDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             scale_score: bool = False,
             return_embedding: bool = False,
             filter_policy: FilterPolicy = FilterPolicy.REPLACE)

Create the InMemoryEmbeddingRetriever component.

Arguments:

  • document_store: An instance of InMemoryDocumentStore.
  • filters: A dictionary with filters to narrow down the search space.
  • top_k: The maximum number of documents to retrieve.
  • scale_score: Scales the BM25 score to a unit interval in the range of 0 to 1, where 1 means extremely relevant. If set to False, uses raw similarity scores.
  • return_embedding: Whether to return the embedding of the retrieved Documents.
  • filter_policy: The filter policy to apply during retrieval.

Raises:

  • ValueError: If the specified top_k is not > 0.

InMemoryEmbeddingRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

InMemoryEmbeddingRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "InMemoryEmbeddingRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

InMemoryEmbeddingRetriever.run

@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None,
        scale_score: Optional[bool] = None,
        return_embedding: Optional[bool] = None)

Run the InMemoryEmbeddingRetriever on the given input data.

Arguments:

  • query_embedding: Embedding of the query.
  • filters: A dictionary with filters to narrow down the search space.
  • top_k: The maximum number of documents to return.
  • scale_score: Scales the similarity score to a unit interval in the range of 0 to 1, where 1 means extremely relevant. If set to False, uses raw similarity scores. If not specified, the value provided at initialization is used.
  • return_embedding: Whether to return the embedding of the retrieved Documents.

Raises:

  • ValueError: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance.

Returns:

The retrieved documents.

Module filter_retriever

FilterRetriever

Retrieves documents that match the provided filters.

Usage example:

from haystack import Document
from haystack.components.retrievers import FilterRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Python is a popular programming language", meta={"lang": "en"}),
    Document(content="python ist eine beliebte Programmiersprache", meta={"lang": "de"}),
]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = FilterRetriever(doc_store, filters={"field": "lang", "operator": "==", "value": "en"})

# if passed in the run method, filters will override those provided at initialization
result = retriever.run(filters={"field": "lang", "operator": "==", "value": "de"})

print(result["documents"])

FilterRetriever.__init__

def __init__(document_store: DocumentStore,
             filters: Optional[Dict[str, Any]] = None)

Create the FilterRetriever component.

Arguments:

  • document_store: An instance of a DocumentStore.
  • filters: A dictionary with filters to narrow down the search space.

FilterRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

FilterRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "FilterRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

FilterRetriever.run

@component.output_types(documents=List[Document])
def run(filters: Optional[Dict[str, Any]] = None)

Run the FilterRetriever on the given input data.

Arguments:

  • filters: A dictionary with filters to narrow down the search space. If not specified, the FilterRetriever uses the value provided at initialization.

Returns:

The retrieved documents.

Module sentence_window_retrieval

SentenceWindowRetrieval

A component that retrieves surrounding documents of a given document from the document store.

This component is designed to work together with one of the existing retrievers, e.g. BM25Retriever, EmbeddingRetriever. One of these retrievers can be used to retrieve documents based on a query and then use this component to get the surrounding documents of the retrieved documents.

SentenceWindowRetrieval.__init__

def __init__(document_store: DocumentStore, window_size: int = 3)

Creates a new SentenceWindowRetrieval component.

Arguments:

  • document_store: The document store to use for retrieving the surrounding documents.
  • window_size: The number of surrounding documents to retrieve.

SentenceWindowRetrieval.merge_documents_text

@staticmethod
def merge_documents_text(documents: List[Document]) -> str

Merge a list of document text into a single string.

This functions concatenates the textual content of a list of documents into a single string, eliminating any overlapping content.

Arguments:

  • documents: List of Documents to merge.

SentenceWindowRetrieval.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

SentenceWindowRetrieval.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "SentenceWindowRetrieval"

Deserializes the component from a dictionary.

Returns:

Deserialized component.

SentenceWindowRetrieval.run

@component.output_types(context_windows=List[str])
def run(retrieved_documents: List[Document])

Based on the source_id and on the doc.meta['split_id'] get surrounding documents from the document store.

Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given document from the document store.

Arguments:

  • retrieved_documents (List[Document]): List of retrieved documents from the previous retriever.

Returns:

A dictionary with the following keys:

  • context_windows: List of strings representing the context windows of the retrieved documents.