Sweep through Document Stores and return a set of candidate documents that are relevant to the query.
Module haystack_experimental.components.retrievers.auto_merging_retriever
AutoMergingRetriever
A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting.
The AutoMergingRetriever assumes you have a hierarchical tree structure of documents, where the leaf nodes are indexed in a document store. See the HierarchicalDocumentSplitter for more information on how to create such a structure. During retrieval, if the number of matched leaf documents below the same parent is higher than a defined threshold, the retriever will return the parent document instead of the individual leaf documents.
The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual chunks alone.
Currently the AutoMergingRetriever can only be used by the following DocumentStores:
from haystack import Document
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
from haystack_experimental.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
# create a hierarchical document structure with 3 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]
# store level-1 parent documents and initialize the retriever
doc_store_parents = InMemoryDocumentStore()
for doc in docs["documents"]:
if doc.meta["children_ids"] and doc.meta["level"] == 1:
doc_store_parents.write_documents([doc])
retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)
# assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
# since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]]
docs = retriever.run(leaf_docs[4:6])
>> {'documents': [Document(id=538..),
>> content: 'warm glow over the trees. Birds began to sing.',
>> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
>> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}
AutoMergingRetriever.__init__
def __init__(document_store: DocumentStore, threshold: float = 0.5)
Initialize the AutoMergingRetriever.
Arguments:
document_store
: DocumentStore from which to retrieve the parent documentsthreshold
: Threshold to decide whether the parent instead of the individual documents is returned
AutoMergingRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
AutoMergingRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AutoMergingRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary with serialized data.
Returns:
An instance of the component.
AutoMergingRetriever.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Run the AutoMergingRetriever.
Recursively groups documents by their parents and merges them if they meet the threshold, continuing up the hierarchy until no more merges are possible.
Arguments:
documents
: List of leaf documents that were matched by a retriever
Returns:
List of documents (could be a mix of different hierarchy levels)
Module haystack_experimental.components.retrievers.chat_message_retriever
ChatMessageRetriever
Retrieves chat messages from the underlying ChatMessageStore.
Usage example:
from haystack.dataclasses import ChatMessage
from haystack_experimental.components.retrievers import ChatMessageRetriever
from haystack_experimental.chat_message_stores.in_memory import InMemoryChatMessageStore
messages = [
ChatMessage.from_assistant("Hello, how can I help you?"),
ChatMessage.from_user("Hi, I have a question about Python. What is a Protocol?"),
]
message_store = InMemoryChatMessageStore()
message_store.write_messages(messages)
retriever = ChatMessageRetriever(message_store)
result = retriever.run()
print(result["messages"])
ChatMessageRetriever.__init__
def __init__(message_store: ChatMessageStore, last_k: int = 10)
Create the ChatMessageRetriever component.
Arguments:
message_store
: An instance of a ChatMessageStore.last_k
: The number of last messages to retrieve. Defaults to 10 messages if not specified.
ChatMessageRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
ChatMessageRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChatMessageRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
ChatMessageRetriever.run
@component.output_types(messages=List[ChatMessage])
def run(last_k: Optional[int] = None)
Run the ChatMessageRetriever
Arguments:
last_k
: The number of last messages to retrieve. This parameter takes precedence over the last_k parameter passed to the ChatMessageRetriever constructor. If unspecified, the last_k parameter passed to the constructor will be used.
Raises:
ValueError
: If last_k is not None and is less than 1
Returns:
messages
- The retrieved chat messages.