DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Sweep through Document Stores and return a set of candidate documents that are relevant to the query.

Module haystack_experimental.components.retrievers.auto_merging_retriever

AutoMergingRetriever

A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting.

The AutoMergingRetriever assumes you have a hierarchical tree structure of documents, where the leaf nodes are indexed in a document store. See the HierarchicalDocumentSplitter for more information on how to create such a structure. During retrieval, if the number of matched leaf documents below the same parent is higher than a defined threshold, the retriever will return the parent document instead of the individual leaf documents.

The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual chunks alone.

Currently the AutoMergingRetriever can only be used by the following DocumentStores:

from haystack import Document
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
from haystack_experimental.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

# create a hierarchical document structure with 3 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]

# store level-1 parent documents and initialize the retriever
doc_store_parents = InMemoryDocumentStore()
for doc in docs["documents"]:
    if doc.meta["children_ids"] and doc.meta["level"] == 1:
        doc_store_parents.write_documents([doc])
retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)

# assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
# since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]]
docs = retriever.run(leaf_docs[4:6])
>> {'documents': [Document(id=538..),
>> content: 'warm glow over the trees. Birds began to sing.',
>> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
>> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}

AutoMergingRetriever.__init__

def __init__(document_store: DocumentStore, threshold: float = 0.5)

Initialize the AutoMergingRetriever.

Arguments:

  • document_store: DocumentStore from which to retrieve the parent documents
  • threshold: Threshold to decide whether the parent instead of the individual documents is returned

AutoMergingRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

AutoMergingRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AutoMergingRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary with serialized data.

Returns:

An instance of the component.

AutoMergingRetriever.run

@component.output_types(documents=List[Document])
def run(documents: List[Document])

Run the AutoMergingRetriever.

Recursively groups documents by their parents and merges them if they meet the threshold, continuing up the hierarchy until no more merges are possible.

Arguments:

  • documents: List of leaf documents that were matched by a retriever

Returns:

List of documents (could be a mix of different hierarchy levels)

Module haystack_experimental.components.retrievers.chat_message_retriever

ChatMessageRetriever

Retrieves chat messages from the underlying ChatMessageStore.

Usage example:

from haystack.dataclasses import ChatMessage
from haystack_experimental.components.retrievers import ChatMessageRetriever
from haystack_experimental.chat_message_stores.in_memory import InMemoryChatMessageStore

messages = [
    ChatMessage.from_assistant("Hello, how can I help you?"),
    ChatMessage.from_user("Hi, I have a question about Python. What is a Protocol?"),
]

message_store = InMemoryChatMessageStore()
message_store.write_messages(messages)
retriever = ChatMessageRetriever(message_store)

result = retriever.run()

print(result["messages"])

ChatMessageRetriever.__init__

def __init__(message_store: ChatMessageStore, last_k: int = 10)

Create the ChatMessageRetriever component.

Arguments:

  • message_store: An instance of a ChatMessageStore.
  • last_k: The number of last messages to retrieve. Defaults to 10 messages if not specified.

ChatMessageRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

ChatMessageRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChatMessageRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

ChatMessageRetriever.run

@component.output_types(messages=List[ChatMessage])
def run(last_k: Optional[int] = None)

Run the ChatMessageRetriever

Arguments:

  • last_k: The number of last messages to retrieve. This parameter takes precedence over the last_k parameter passed to the ChatMessageRetriever constructor. If unspecified, the last_k parameter passed to the constructor will be used.

Raises:

  • ValueError: If last_k is not None and is less than 1

Returns:

  • messages - The retrieved chat messages.