DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
Documentation

AutoMergingRetriever

Use AutoMergingRetriever to improve search results by returning complete parent documents instead of fragmented chunks when multiple related pieces match a query.

Most common position in a pipelineUsed after the main Retriever component that returns hierarchical documents.
Mandatory init variables“document_store”: Document Store from which to retrieve the parent documents
Mandatory run variables“documents”: A list of leaf documents that were matched by a Retriever
Output variables“documents”: A list resulting documents
API referenceRetrievers
GitHub linkhttps://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/retrievers/auto_merging_retriever.py

Overview

The AutoMergingRetriever is a component that works with a hierarchical document structure. It returns the parent documents instead of individual leaf documents when a certain threshold is met.

This can be particularly useful when working with paragraphs split into multiple chunks. When several chunks from the same paragraph match your query, the complete paragraph often provides more context and value than the individual pieces alone.

Here is how this Retriever works:

  1. It requires documents to be organized in a tree structure, with leaf nodes stored in a document index - see HierarchicalDocumentSplitter documentation.
  2. When searching, it counts how many leaf documents under the same parent match your query.
  3. If this count exceeds your defined threshold, it returns the parent document instead of the individual leaves.

The AutoMergingRetriever can currently be used by the following Document Stores:

Usage

On its own

from haystack import Document
from haystack.components.preprocessors import HierarchicalDocumentSplitter
from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

# create a hierarchical document structure with 3 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]

# store level-1 parent documents and initialize the retriever
doc_store_parents = InMemoryDocumentStore()
for doc in docs["documents"]:
    if doc.meta["children_ids"] and doc.meta["level"] == 1:
        doc_store_parents.write_documents([doc])
retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)

# assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
# since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]]
docs = retriever.run(leaf_docs[4:6])
>> {'documents': [Document(id=538..),
>> content: 'warm glow over the trees. Birds began to sing.',
>> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
>> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}

In a pipeline

This is an example of a RAG Haystack pipeline. It first retrieves leaf-level document chunks using BM25, merges them into higher-level parent documents with AutoMergingRetriever, constructs a prompt, and generates an answer using OpenAI's chat model.

from typing import List, Tuple
from haystack import Document, Pipeline
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.dataclasses import ChatMessage

def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
    splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
    docs = splitter.run(documents)

    leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
    leaf_doc_store = InMemoryDocumentStore()
    leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE)

    parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
    parent_doc_store = InMemoryDocumentStore()
    parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE)

    return leaf_doc_store, parent_doc_store

# Add documents
docs = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
    Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")
]

leaf_docs, parent_docs = indexing(docs)

prompt_template = [
    ChatMessage.from_system("You are a helpful assistant."),
    ChatMessage.from_user(
        "Given these documents, answer the question.\nDocuments:\n"
        "{% for doc in documents %}{{ doc.content }}{% endfor %}\n"
        "Question: {{question}}\nAnswer:"
    )
]

rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=leaf_docs), name="bm25_retriever")
rag_pipeline.add_component(instance=AutoMergingRetriever(parent_docs, threshold=0.6), name="retriever")
rag_pipeline.add_component(instance=ChatPromptBuilder(template=prompt_template, required_variables={"question", "documents"}), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIChatGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

rag_pipeline.connect("bm25_retriever.documents", "retriever.documents")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.messages", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

question = "How many languages are there?"
result = rag_pipeline.run({
    "bm25_retriever": {"query": question},
    "prompt_builder": {"question": question},
    "answer_builder": {"query": question}
})