Sweep through Document Stores and return a set of candidate documents that are relevant to the query.
Module haystack_experimental.components.retrievers.auto_merging_retriever
AutoMergingRetriever
A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting.
The AutoMergingRetriever assumes you have a hierarchical tree structure of documents, where the leaf nodes are indexed in a document store. See the HierarchicalDocumentSplitter for more information on how to create such a structure. During retrieval, if the number of matched leaf documents below the same parent is higher than a defined threshold, the retriever will return the parent document instead of the individual leaf documents.
The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual chunks alone.
Currently the AutoMergingRetriever can only be used by the following DocumentStores:
from haystack import Document
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
from haystack_experimental.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
# create a hierarchical document structure with 2 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]
# store level-1 parent documents and initialize the retriever
doc_store_parents = InMemoryDocumentStore()
for doc in docs["documents"]:
if doc.meta["children_ids"] and doc.meta["level"] == 1:
doc_store_parents.write_documents([doc])
retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)
# assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
# since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]]
docs = retriever.run(leaf_docs[4:6])
>> {'documents': [Document(id=538..),
>> content: 'warm glow over the trees. Birds began to sing.',
>> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
>> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}
AutoMergingRetriever.__init__
def __init__(document_store: DocumentStore, threshold: float = 0.5)
Initialize the AutoMergingRetriever.
Arguments:
document_store
: DocumentStore from which to retrieve the parent documentsthreshold
: Threshold to decide whether the parent instead of the individual documents is returned
AutoMergingRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
AutoMergingRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AutoMergingRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary with serialized data.
Returns:
An instance of the component.
AutoMergingRetriever.run
@component.output_types(documents=List[Document])
def run(matched_leaf_documents: List[Document])
Run the AutoMergingRetriever.
Groups the matched leaf documents by their parent documents and returns the parent documents if the number of matched leaf documents below the same parent is higher than the defined threshold. Otherwise, returns the matched leaf documents.
Arguments:
matched_leaf_documents
: List of leaf documents that were matched by a retriever
Returns:
List of parent documents or matched leaf documents based on the threshold value
Module haystack_experimental.components.retrievers.chat_message_retriever
ChatMessageRetriever
Retrieves chat messages from the underlying ChatMessageStore.
Usage example:
from haystack.dataclasses import ChatMessage
from haystack_experimental.components.retrievers import ChatMessageRetriever
from haystack_experimental.chat_message_stores.in_memory import InMemoryChatMessageStore
messages = [
ChatMessage.from_assistant("Hello, how can I help you?"),
ChatMessage.from_user("Hi, I have a question about Python. What is a Protocol?"),
]
message_store = InMemoryChatMessageStore()
message_store.write_messages(messages)
retriever = ChatMessageRetriever(message_store)
result = retriever.run()
print(result["messages"])
ChatMessageRetriever.__init__
def __init__(message_store: ChatMessageStore, last_k: int = 10)
Create the ChatMessageRetriever component.
Arguments:
message_store
: An instance of a ChatMessageStore.last_k
: The number of last messages to retrieve. Defaults to 10 messages if not specified.
ChatMessageRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
ChatMessageRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChatMessageRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
ChatMessageRetriever.run
@component.output_types(messages=List[ChatMessage])
def run(last_k: Optional[int] = None)
Run the ChatMessageRetriever
Arguments:
last_k
: The number of last messages to retrieve. This parameter takes precedence over the last_k parameter passed to the ChatMessageRetriever constructor. If unspecified, the last_k parameter passed to the constructor will be used.
Raises:
ValueError
: If last_k is not None and is less than 1
Returns:
messages
- The retrieved chat messages.
Module haystack_experimental.components.retrievers.in_memory.bm25_retriever
InMemoryBM25Retriever
Retrieves documents that are most similar to the query using keyword-based algorithm.
Use this retriever with the InMemoryDocumentStore.
Usage example
from haystack import Document
from haystack_experimental.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack_experimental.document_stores.in_memory import InMemoryDocumentStore
docs = [
Document(content="Python is a popular programming language"),
Document(content="python ist eine beliebte Programmiersprache"),
]
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store)
result = retriever.run(query="Programmiersprache")
print(result["documents"])
InMemoryBM25Retriever.__init__
def __init__(document_store: InMemoryDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False,
filter_policy: FilterPolicy = FilterPolicy.REPLACE)
Create the InMemoryBM25Retriever component.
Arguments:
document_store
: An instance of InMemoryDocumentStore where the retriever should search for relevant documents.filters
: A dictionary with filters to narrow down the retriever's search space in the document store.top_k
: The maximum number of documents to retrieve.scale_score
: WhenTrue
, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. WhenFalse
, uses raw similarity scores.filter_policy
: The filter policy to apply during retrieval. Filter policy determines how filters are applied when retrieving documents. You can choose:REPLACE
(default): Overrides the initialization filters with the filters specified at runtime. Use this policy to dynamically change filtering for specific queries.MERGE
: Combines runtime filters with initialization filters to narrow down the search.
Raises:
ValueError
: If the specifiedtop_k
is not > 0.
InMemoryBM25Retriever.run_async
@component.output_types(documents=List[Document])
async def run_async(query: str,
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
scale_score: Optional[bool] = None)
Run the InMemoryBM25Retriever on the given input data.
Arguments:
query
: The query string for the Retriever.filters
: A dictionary with filters to narrow down the search space when retrieving documents.top_k
: The maximum number of documents to return.scale_score
: WhenTrue
, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. WhenFalse
, uses raw similarity scores.
Raises:
ValueError
: If the specified DocumentStore is not found or is not a InMemoryDocumentStore instance.
Returns:
The retrieved documents.
Module haystack_experimental.components.retrievers.in_memory.embedding_retriever
InMemoryEmbeddingRetriever
Retrieves documents that are most semantically similar to the query.
Use this retriever with the InMemoryDocumentStore.
When using this retriever, make sure it has query and document embeddings available. In indexing pipelines, use a DocumentEmbedder to embed documents. In query pipelines, use a TextEmbedder to embed queries and send them to the retriever.
Usage example
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack_experimental.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack_experimental.document_stores.in_memory import InMemoryDocumentStore
docs = [
Document(content="Python is a popular programming language"),
Document(content="python ist eine beliebte Programmiersprache"),
]
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()
docs_with_embeddings = doc_embedder.run(docs)["documents"]
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs_with_embeddings)
retriever = InMemoryEmbeddingRetriever(doc_store)
query="Programmiersprache"
text_embedder = SentenceTransformersTextEmbedder()
text_embedder.warm_up()
query_embedding = text_embedder.run(query)["embedding"]
result = retriever.run(query_embedding=query_embedding)
print(result["documents"])
InMemoryEmbeddingRetriever.__init__
def __init__(document_store: InMemoryDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool = False,
filter_policy: FilterPolicy = FilterPolicy.REPLACE)
Create the InMemoryEmbeddingRetriever component.
Arguments:
document_store
: An instance of InMemoryDocumentStore where the retriever should search for relevant documents.filters
: A dictionary with filters to narrow down the retriever's search space in the document store.top_k
: The maximum number of documents to retrieve.scale_score
: WhenTrue
, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. WhenFalse
, uses raw similarity scores.return_embedding
: WhenTrue
, returns the embedding of the retrieved documents. WhenFalse
, returns just the documents, without their embeddings.filter_policy
: The filter policy to apply during retrieval. Filter policy determines how filters are applied when retrieving documents. You can choose:REPLACE
(default): Overrides the initialization filters with the filters specified at runtime. Use this policy to dynamically change filtering for specific queries.MERGE
: Combines runtime filters with initialization filters to narrow down the search.
Raises:
ValueError
: If the specified top_k is not > 0.
InMemoryEmbeddingRetriever.run_async
@component.output_types(documents=List[Document])
async def run_async(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
scale_score: Optional[bool] = None,
return_embedding: Optional[bool] = None)
Run the InMemoryEmbeddingRetriever on the given input data.
Arguments:
query_embedding
: Embedding of the query.filters
: A dictionary with filters to narrow down the search space when retrieving documents.top_k
: The maximum number of documents to return.scale_score
: WhenTrue
, scales the score of retrieved documents to a range of 0 to 1, where 1 means extremely relevant. WhenFalse
, uses raw similarity scores.return_embedding
: WhenTrue
, returns the embedding of the retrieved documents. WhenFalse
, returns just the documents, without their embeddings.
Raises:
ValueError
: If the specified DocumentStore is not found or is not an InMemoryDocumentStore instance.
Returns:
The retrieved documents.
Module haystack_experimental.components.retrievers.opensearch.bm25_retriever
OpenSearchBM25Retriever
OpenSearch BM25 retriever with async support.
OpenSearchBM25Retriever.__init__
def __init__(*,
document_store: OpenSearchDocumentStore,
filters: Optional[Dict[str, Any]] = None,
fuzziness: str = "AUTO",
top_k: int = 10,
scale_score: bool = False,
all_terms_must_match: bool = False,
filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE,
custom_query: Optional[Dict[str, Any]] = None,
raise_on_failure: bool = True)
Creates the OpenSearchBM25Retriever component.
Arguments:
-
document_store
: An instance of OpenSearchDocumentStore to use with the Retriever. -
filters
: Filters to narrow down the search for documents in the Document Store. -
fuzziness
: Fuzziness parameter for full-text queries to apply approximate string matching. For more information, see OpenSearch fuzzy query. -
top_k
: Maximum number of documents to return. -
scale_score
: IfTrue
, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes. -
all_terms_must_match
: IfTrue
, all terms in the query string must be present in the retrieved documents. This is useful when searching for short text where even one term can make a difference. -
filter_policy
: Policy to determine how filters are applied. Possible options: -
replace
: Runtime filters replace initialization filters. Use this policy to change the filtering scope for specific queries. -
merge
: Runtime filters are merged with initialization filters. -
custom_query
: The query containing a mandatory$query
and an optional$filters
placeholder. An example custom_query:{ "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } }
An example run()
method for this custom_query
:
retriever.run(
query="Why did the revenue increase?",
filters={
"operator": "AND",
"conditions": [
{"field": "meta.years", "operator": "==", "value": "2019"},
{"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]},
],
},
)
raise_on_failure
: Whether to raise an exception if the API call fails. Otherwise log a warning and return an empty list.
Raises:
ValueError
: Ifdocument_store
is not an instance of OpenSearchDocumentStore.
OpenSearchBM25Retriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
OpenSearchBM25Retriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchBM25Retriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
OpenSearchBM25Retriever.run
@component.output_types(documents=List[Document])
def run(query: str,
filters: Optional[Dict[str, Any]] = None,
all_terms_must_match: Optional[bool] = None,
top_k: Optional[int] = None,
fuzziness: Optional[str] = None,
scale_score: Optional[bool] = None,
custom_query: Optional[Dict[str, Any]] = None)
Retrieve documents using BM25 retrieval.
Arguments:
query
: The query string.filters
: Filters applied to the retrieved documents. The way runtime filters are applied depends on thefilter_policy
specified at Retriever's initialization.all_terms_must_match
: IfTrue
, all terms in the query string must be present in the retrieved documents.top_k
: Maximum number of documents to return.fuzziness
: Fuzziness parameter for full-text queries to apply approximate string matching. For more information, see OpenSearch fuzzy query.scale_score
: IfTrue
, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes.custom_query
: A custom OpenSearch query. It must include a$query
and may optionally include a$filters
placeholder.
Returns:
A dictionary containing the retrieved documents with the following structure:
- documents: List of retrieved Documents.
OpenSearchBM25Retriever.run_async
@component.output_types(documents=List[Document])
async def run_async(query: str,
filters: Optional[Dict[str, Any]] = None,
all_terms_must_match: Optional[bool] = None,
top_k: Optional[int] = None,
fuzziness: Optional[str] = None,
scale_score: Optional[bool] = None,
custom_query: Optional[Dict[str, Any]] = None)
Retrieve documents using BM25 retrieval.
Arguments:
query
: The query string.filters
: Filters applied to the retrieved documents. The way runtime filters are applied depends on thefilter_policy
specified at Retriever's initialization.all_terms_must_match
: IfTrue
, all terms in the query string must be present in the retrieved documents.top_k
: Maximum number of documents to return.fuzziness
: Fuzziness parameter for full-text queries to apply approximate string matching. For more information, see OpenSearch fuzzy query.scale_score
: IfTrue
, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes.custom_query
: A custom OpenSearch query. It must include a$query
and may optionally include a$filters
placeholder.
Returns:
A dictionary containing the retrieved documents with the following structure:
- documents: List of retrieved Documents.
Module haystack_experimental.components.retrievers.opensearch.embedding_retriever
OpenSearchEmbeddingRetriever
OpenSearch embedding retriever with async support.
OpenSearchEmbeddingRetriever.__init__
def __init__(*,
document_store: OpenSearchDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE,
custom_query: Optional[Dict[str, Any]] = None,
raise_on_failure: bool = True)
Create the OpenSearchEmbeddingRetriever component.
Arguments:
document_store
: An instance of OpenSearchDocumentStore to use with the Retriever.filters
: Filters applied when fetching documents from the Document Store. Filters are applied during the approximate kNN search to ensure the Retriever returnstop_k
matching documents.top_k
: Maximum number of documents to return.filter_policy
: Policy to determine how filters are applied. Possible options:merge
: Runtime filters are merged with initialization filters.replace
: Runtime filters replace initialization filters. Use this policy to change the filtering scope.custom_query
: The custom OpenSearch query containing a mandatory$query_embedding
and an optional$filters
placeholder.raise_on_failure
: IfTrue
, raises an exception if the API call fails. IfFalse
, logs a warning and returns an empty list.
Raises:
ValueError
: Ifdocument_store
is not an instance of OpenSearchDocumentStore.
OpenSearchEmbeddingRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
OpenSearchEmbeddingRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchEmbeddingRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
OpenSearchEmbeddingRetriever.run
@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
custom_query: Optional[Dict[str, Any]] = None)
Retrieve documents using a vector similarity metric.
Arguments:
query_embedding
: Embedding of the query.filters
: Filters applied when fetching documents from the Document Store. Filters are applied during the approximate kNN search to ensure the Retriever returnstop_k
matching documents. The way runtime filters are applied depends on thefilter_policy
selected when initializing the Retriever.top_k
: Maximum number of documents to return.custom_query
: A custom OpenSearch query containing a mandatory$query_embedding
and an optional$filters
placeholder.
Returns:
Dictionary with key "documents" containing the retrieved Documents.
- documents: List of Document similar to
query_embedding
.
OpenSearchEmbeddingRetriever.run_async
@component.output_types(documents=List[Document])
async def run_async(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
custom_query: Optional[Dict[str, Any]] = None)
Retrieve documents using a vector similarity metric.
Arguments:
query_embedding
: Embedding of the query.filters
: Filters applied when fetching documents from the Document Store. Filters are applied during the approximate kNN search to ensure the Retriever returnstop_k
matching documents. The way runtime filters are applied depends on thefilter_policy
selected when initializing the Retriever.top_k
: Maximum number of documents to return.custom_query
: A custom OpenSearch query containing a mandatory$query_embedding
and an optional$filters
placeholder.
Returns:
Dictionary with key "documents" containing the retrieved Documents.
- documents: List of Document similar to
query_embedding
.