DocumentJoiner
Use this component in hybrid retrieval pipelines or indexing pipelines with multiple file converters to join lists of Documents.
Name | DocumentJoiner |
Folder Path | /joiners/ |
Position in a Pipeline | In indexing and query Pipelines, after components that return a list of Documents such as multiple Retrievers or multiple File Converters. |
Inputs | “documents”: List of Document objects. This input is variadic , meaning you can connect a variable number of components to it. |
Outputs | “documents”: List of Document objects |
Overview
DocumentJoiner
joins input lists of Documents from multiple connections and outputs them as one list. You can choose how you want the lists to be joined by specifying the join_mode
. There are three options available:
concatenate
- Combines Document from multiple components, discarding any duplicates. Documents get their scores from the last component in the pipeline that assigns scores. This mode doesn’t influence Document scores.merge
- Merges the scores of duplicate Documents coming from multiple components. You can also assign a weight to the scores to influence how they’re merged and set the top_k limit to specify how many documents you want DocumentJoiner to return.reciprocal_rank_fusion
- Combines Documents into a single list based on their ranking received from multiple components. It then calculates a new score based on the ranks of Documents in the input lists. If the same Document appears in more than one list (was returned by multiple components), it gets a higher score.
Usage
On its own
Below is an example where we are using the DocumentJoiner to merge two lists of Documents. We run the DocumentJoiner and provide the documents. It returns a list of Documents ranked by combined scores. By default, equal weight is given to each Retriever score. You could also use custom weights by setting the weights parameter to a list of floats with one weight per input component.
from haystack import Document
from haystack.components.joiners.document_joiner import DocumentJoiner
docs_1 = [Document(content="Paris is the capital of France.", score=0.5), Document(content="Berlin is the capital of Germany.", score=0.4)]
docs_2 = [Document(content="Paris is the capital of France.", score=0.6), Document(content="Rome is the capital of Italy.", score=0.5)]
joiner = DocumentJoiner(join_mode="merge")
joiner.run(documents=[docs_1, docs_2])
# {'documents': [Document(id=0f5beda04153dbfc462c8b31f8536749e43654709ecf0cfe22c6d009c9912214, content: 'Paris is the capital of France.', score: 0.55), Document(id=424beed8b549a359239ab000f33ca3b1ddb0f30a988bbef2a46597b9c27e42f2, content: 'Rome is the capital of Italy.', score: 0.25), Document(id=312b465e77e25c11512ee76ae699ce2eb201f34c8c51384003bb367e24fb6cf8, content: 'Berlin is the capital of Germany.', score: 0.2)]}
In a Pipeline
Below is an example of a hybrid retrieval pipeline that retrieves Documents from an InMemoryDocumentStore based on keyword search (using InMemoryBM25Retriever) and embedding search (using InMemoryEmbeddingRetriever). It then uses the DocumentJoiner with its default join mode to concatenate the retrieved Documents into one list. The DocumentStore must contain Documents with embeddings, otherwise the InMemoryEmbeddingRetriever will not return any Documents.
from haystack.components.joiners.document_joiner import DocumentJoiner
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.embedders import SentenceTransformersTextEmbedder
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="bm25_retriever")
p.add_component(
instance=SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
name="text_embedder",
)
p.add_component(instance=InMemoryEmbeddingRetriever(document_store=document_store), name="embedding_retriever")
p.add_component(instance=DocumentJoiner(), name="joiner")
p.connect("bm25_retriever", "joiner")
p.connect("embedding_retriever", "joiner")
p.connect("text_embedder", "embedding_retriever")
query = "What is the capital of France?"
p.run(data={"bm25_retriever": {"query": query},
"text_embedder": {"text": query}})
Updated 9 months ago
See the parameters details in our API reference: