Sweeps through a Document Store and returns a set of candidate Documents that are relevant to the query.
Module in_memory/bm25_retriever
InMemoryBM25Retriever
Retrieves documents that are most similar to the query using keyword-based algorithm.
Use this retriever with the InMemoryDocumentStore.
Usage example
from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
docs = [
Document(content="Python is a popular programming language"),
Document(content="python ist eine beliebte Programmiersprache"),
]
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store)
result = retriever.run(query="Programmiersprache")
print(result["documents"])
InMemoryBM25Retriever.__init__
def __init__(document_store: InMemoryDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False,
filter_policy: FilterPolicy = FilterPolicy.REPLACE)
Create the InMemoryBM25Retriever component.
Arguments:
document_store
: An instance of InMemoryDocumentStore where the retriever should search for relevant documents.filters
: A dictionary with filters to narrow down the retriever's search space in the document store.top_k
: The maximum number of documents to retrieve.scale_score
: WhenTrue
, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. WhenFalse
, uses raw similarity scores.filter_policy
: The filter policy to apply during retrieval. Filter policy determines how filters are applied when retrieving documents. You can choose:REPLACE
(default): Overrides the initialization filters with the filters specified at runtime. Use this policy to dynamically change filtering for specific queries.MERGE
: Combines runtime filters with initialization filters to narrow down the search.
Raises:
ValueError
: If the specifiedtop_k
is not > 0.
InMemoryBM25Retriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
InMemoryBM25Retriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "InMemoryBM25Retriever"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
InMemoryBM25Retriever.run
@component.output_types(documents=List[Document])
def run(query: str,
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
scale_score: Optional[bool] = None)
Run the InMemoryBM25Retriever on the given input data.
Arguments:
query
: The query string for the Retriever.filters
: A dictionary with filters to narrow down the search space when retrieving documents.top_k
: The maximum number of documents to return.scale_score
: WhenTrue
, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. WhenFalse
, uses raw similarity scores.
Raises:
ValueError
: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance.
Returns:
The retrieved documents.
Module in_memory/embedding_retriever
InMemoryEmbeddingRetriever
Retrieves documents that are most semantically similar to the query.
Use this retriever with the InMemoryDocumentStore.
When using this retriever, make sure it has query and document embeddings available. In indexing pipelines, use a DocumentEmbedder to embed documents. In query pipelines, use a TextEmbedder to embed queries and send them to the retriever.
Usage example
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
docs = [
Document(content="Python is a popular programming language"),
Document(content="python ist eine beliebte Programmiersprache"),
]
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()
docs_with_embeddings = doc_embedder.run(docs)["documents"]
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs_with_embeddings)
retriever = InMemoryEmbeddingRetriever(doc_store)
query="Programmiersprache"
text_embedder = SentenceTransformersTextEmbedder()
text_embedder.warm_up()
query_embedding = text_embedder.run(query)["embedding"]
result = retriever.run(query_embedding=query_embedding)
print(result["documents"])
InMemoryEmbeddingRetriever.__init__
def __init__(document_store: InMemoryDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool = False,
filter_policy: FilterPolicy = FilterPolicy.REPLACE)
Create the InMemoryEmbeddingRetriever component.
Arguments:
document_store
: An instance of InMemoryDocumentStore where the retriever should search for relevant documents.filters
: A dictionary with filters to narrow down the retriever's search space in the document store.top_k
: The maximum number of documents to retrieve.scale_score
: WhenTrue
, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. WhenFalse
, uses raw similarity scores.return_embedding
: WhenTrue
, returns the embedding of the retrieved documents. WhenFalse
, returns just the documents, without their embeddings.filter_policy
: The filter policy to apply during retrieval. Filter policy determines how filters are applied when retrieving documents. You can choose:REPLACE
(default): Overrides the initialization filters with the filters specified at runtime. Use this policy to dynamically change filtering for specific queries.MERGE
: Combines runtime filters with initialization filters to narrow down the search.
Raises:
ValueError
: If the specified top_k is not > 0.
InMemoryEmbeddingRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
InMemoryEmbeddingRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "InMemoryEmbeddingRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
InMemoryEmbeddingRetriever.run
@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
scale_score: Optional[bool] = None,
return_embedding: Optional[bool] = None)
Run the InMemoryEmbeddingRetriever on the given input data.
Arguments:
query_embedding
: Embedding of the query.filters
: A dictionary with filters to narrow down the search space when retrieving documents.top_k
: The maximum number of documents to return.scale_score
: WhenTrue
, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. WhenFalse
, uses raw similarity scores.return_embedding
: WhenTrue
, returns the embedding of the retrieved documents. WhenFalse
, returns just the documents, without their embeddings.
Raises:
ValueError
: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance.
Returns:
The retrieved documents.
Module filter_retriever
FilterRetriever
Retrieves documents that match the provided filters.
Usage example
from haystack import Document
from haystack.components.retrievers import FilterRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
docs = [
Document(content="Python is a popular programming language", meta={"lang": "en"}),
Document(content="python ist eine beliebte Programmiersprache", meta={"lang": "de"}),
]
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = FilterRetriever(doc_store, filters={"field": "lang", "operator": "==", "value": "en"})
# if passed in the run method, filters override those provided at initialization
result = retriever.run(filters={"field": "lang", "operator": "==", "value": "de"})
print(result["documents"])
FilterRetriever.__init__
def __init__(document_store: DocumentStore,
filters: Optional[Dict[str, Any]] = None)
Create the FilterRetriever component.
Arguments:
document_store
: An instance of a Document Store to use with the Retriever.filters
: A dictionary with filters to narrow down the search space.
FilterRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
FilterRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "FilterRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
FilterRetriever.run
@component.output_types(documents=List[Document])
def run(filters: Optional[Dict[str, Any]] = None)
Run the FilterRetriever on the given input data.
Arguments:
filters
: A dictionary with filters to narrow down the search space. If not specified, the FilterRetriever uses the values provided at initialization.
Returns:
A list of retrieved documents.
Module sentence_window_retriever
SentenceWindowRetriever
Retrieves documents adjacent to a given document in the Document Store.
During indexing, documents are broken into smaller chunks, or sentences. When you submit a query,
the Retriever fetches the most relevant sentence. To provide full context,
SentenceWindowRetriever fetches a number of neighboring sentences before and after each
relevant one. You can set this number with the window_size
parameter.
It uses source_id
and doc.meta['split_id']
to locate the surrounding documents.
This component works with existing Retrievers, like BM25Retriever or EmbeddingRetriever. First, use a Retriever to find documents based on a query and then use SentenceWindowRetriever to get the surrounding documents for context.
Usage example
from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import SentenceWindowRetriever
from haystack.components.preprocessors import DocumentSplitter
from haystack.document_stores.in_memory import InMemoryDocumentStore
splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word")
text = (
"This is a text with some words. There is a second sentence. And there is also a third sentence. "
"It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence"
)
doc = Document(content=text)
docs = splitter.run([doc])
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs["documents"])
rag = Pipeline()
rag.add_component("bm25_retriever", InMemoryBM25Retriever(doc_store, top_k=1))
rag.add_component("sentence_window_retriever", SentenceWindowRetriever(document_store=doc_store, window_size=2))
rag.connect("bm25_retriever", "sentence_window_retriever")
rag.run({'bm25_retriever': {"query":"third"}})
>> {'sentence_window_retriever': {'context_windows': ['some words. There is a second sentence.
>> And there is also a third sentence. It also contains a fourth sentence. And a fifth sentence. And a sixth
>> sentence. And a']}}
SentenceWindowRetriever.__init__
def __init__(document_store: DocumentStore, window_size: int = 3)
Creates a new SentenceWindowRetriever component.
Arguments:
document_store
: The Document Store to retrieve the surrounding documents from.window_size
: The number of documents to retrieve before and after the relevant one. For example,window_size: 2
fetches 2 preceding and 2 following documents.
SentenceWindowRetriever.merge_documents_text
@staticmethod
def merge_documents_text(documents: List[Document]) -> str
Merge a list of document text into a single string.
This functions concatenates the textual content of a list of documents into a single string, eliminating any overlapping content.
Arguments:
documents
: List of Documents to merge.
SentenceWindowRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
SentenceWindowRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "SentenceWindowRetriever"
Deserializes the component from a dictionary.
Returns:
Deserialized component.
SentenceWindowRetriever.run
@component.output_types(context_windows=List[str])
def run(retrieved_documents: List[Document])
Based on the source_id
and on the doc.meta['split_id']
get surrounding documents from the document store.
Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given document from the document store.
Arguments:
retrieved_documents
: List of retrieved documents from the previous retriever.
Returns:
A dictionary with the following keys:
context_windows
: List of strings representing the context windows of the retrieved documents.
Module sentence_window_retrieval
SentenceWindowRetrieval
This class is deprecated. Please use SentenceWindowRetriever
instead.
SentenceWindowRetrieval.merge_documents_text
@staticmethod
def merge_documents_text(documents: List[Document]) -> str
Merge a list of document text into a single string.
This functions concatenates the textual content of a list of documents into a single string, eliminating any overlapping content.
Arguments:
documents
: List of Documents to merge.
SentenceWindowRetrieval.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
SentenceWindowRetrieval.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "SentenceWindowRetriever"
Deserializes the component from a dictionary.
Returns:
Deserialized component.
SentenceWindowRetrieval.run
@component.output_types(context_windows=List[str])
def run(retrieved_documents: List[Document])
Based on the source_id
and on the doc.meta['split_id']
get surrounding documents from the document store.
Implements the logic behind the sentence-window technique, retrieving the surrounding documents of a given document from the document store.
Arguments:
retrieved_documents
: List of retrieved documents from the previous retriever.
Returns:
A dictionary with the following keys:
context_windows
: List of strings representing the context windows of the retrieved documents.