PyversityRanker
Use this component to rerank documents by balancing relevance and diversity using pyversity's diversification algorithms.
| Most common position in a pipeline | In a query pipeline, after a dense Retriever with return_embedding=True |
| Mandatory init variables | None |
| Mandatory run variables | documents: A list of document objects, each with score and embedding set |
| Output variables | documents: A list of document objects |
| API reference | Pyversity |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/pyversity |
Overview
PyversityRanker reranks Documents using pyversity's diversification algorithms. Unlike similarity-based rankers, it balances relevance and diversity - so the output isn't just the most relevant documents, but a varied selection that avoids redundancy.
Documents must have both score and embedding populated. This makes it a natural fit after a dense retriever such as InMemoryEmbeddingRetriever configured with return_embedding=True. Documents missing either field are skipped with a warning.
The key parameters are:
strategy: The diversification algorithm to use. Defaults toStrategy.DPP(Determinantal Point Process).Strategy.MMR(Maximal Marginal Relevance) is another popular option.diversity: A float in[0, 1]controlling the relevance–diversity trade-off.0.0keeps the most relevant documents;1.0maximises diversity regardless of relevance. Defaults to0.5.top_k: The number of documents to return. IfNone, all documents are returned in diversified order.
Installation
To start using this integration with Haystack, install the package with:
Usage
On its own
This example uses PyversityRanker to rerank five documents. Each document must have a score and embedding set. The ranker returns the top 3 documents using the MMR strategy with a diversity of 0.7.
from haystack import Document
from pyversity import Strategy
from haystack_integrations.components.rankers.pyversity import PyversityRanker
documents = [
Document(
content="Paris is the capital of France.",
score=0.95,
embedding=[0.9, 0.1, 0.0, 0.0],
),
Document(
content="The Eiffel Tower is located in Paris.",
score=0.90,
embedding=[0.8, 0.2, 0.0, 0.0],
),
Document(
content="Berlin is the capital of Germany.",
score=0.85,
embedding=[0.0, 0.0, 0.9, 0.1],
),
Document(
content="The Brandenburg Gate is in Berlin.",
score=0.80,
embedding=[0.0, 0.0, 0.8, 0.2],
),
Document(
content="France borders Spain to the south.",
score=0.75,
embedding=[0.5, 0.5, 0.0, 0.0],
),
]
ranker = PyversityRanker(top_k=3, strategy=Strategy.MMR, diversity=0.7)
result = ranker.run(documents=documents)
for doc in result["documents"]:
print(f"{doc.score:.2f} {doc.content}")
In a pipeline
Below is an example of a pipeline that embeds documents and stores them in an InMemoryDocumentStore. It then retrieves the top 6 documents using InMemoryEmbeddingRetriever and reranks them with PyversityRanker to return 3 diverse results.
Note that the retriever must be configured with return_embedding=True so that documents have embeddings available for the ranker.
from haystack import Document, Pipeline
from haystack.components.embedders import (
SentenceTransformersDocumentEmbedder,
SentenceTransformersTextEmbedder,
)
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from pyversity import Strategy
from haystack_integrations.components.rankers.pyversity import PyversityRanker
# Index documents
document_store = InMemoryDocumentStore()
raw_documents = [
Document(content="Paris is the capital of France."),
Document(content="The Eiffel Tower is located in Paris."),
Document(content="Berlin is the capital of Germany."),
Document(content="The Brandenburg Gate is in Berlin."),
Document(content="France borders Spain to the south."),
Document(content="The Louvre is the world's largest art museum and is in Paris."),
Document(content="Munich is the capital of Bavaria."),
Document(content="The Rhine river flows through Germany and France."),
]
doc_embedder = SentenceTransformersDocumentEmbedder()
documents_with_embeddings = doc_embedder.run(raw_documents)["documents"]
document_store.write_documents(documents_with_embeddings)
# Build pipeline
pipeline = Pipeline()
pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
pipeline.add_component(
"retriever",
InMemoryEmbeddingRetriever(
document_store=document_store,
top_k=6,
return_embedding=True,
),
)
pipeline.add_component(
"ranker",
PyversityRanker(top_k=3, strategy=Strategy.MMR, diversity=0.7),
)
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
pipeline.connect("retriever.documents", "ranker.documents")
# Run
result = pipeline.run(
{"text_embedder": {"text": "What are the famous landmarks in France?"}},
)
for doc in result["ranker"]["documents"]:
print(f"{doc.score:.4f} {doc.content}")
PyversityRanker requires documents to have both score and embedding set. When using a dense retriever, make sure to pass return_embedding=True. Documents missing either field are skipped with a warning.