Version: 2.26-unstable

Samplers

top_p

TopPSampler

Implements top-p (nucleus) sampling for document filtering based on cumulative probability scores.

This component provides functionality to filter a list of documents by selecting those whose scores fall within the top 'p' percent of the cumulative distribution. It is useful for focusing on high-probability documents while filtering out less relevant ones based on their assigned scores.

Usage example:

python

from haystack import Document
from haystack.components.samplers import TopPSampler

sampler = TopPSampler(top_p=0.95, score_field="similarity_score")
docs = [
    Document(content="Berlin", meta={"similarity_score": -10.6}),
    Document(content="Belgrade", meta={"similarity_score": -8.9}),
    Document(content="Sarajevo", meta={"similarity_score": -4.6}),
]
output = sampler.run(documents=docs)
docs = output["documents"]
assert len(docs) == 1
assert docs[0].content == "Sarajevo"

init

python

__init__(
    top_p: float = 1.0,
    score_field: str | None = None,
    min_top_k: int | None = None,
)

Creates an instance of TopPSampler.

Parameters:

top_p (float) – Float between 0 and 1 representing the cumulative probability threshold for document selection. A value of 1.0 indicates no filtering (all documents are retained).
score_field (str | None) – Name of the field in each document's metadata that contains the score. If None, the default document score field is used.
min_top_k (int | None) – If specified, the minimum number of documents to return. If the top_p selects fewer documents, additional ones with the next highest scores are added to the selection.

run

python

run(documents: list[Document], top_p: float | None = None)

Filters documents using top-p sampling based on their scores.

If the specified top_p results in no documents being selected (especially in cases of a low top_p value), the method returns the document with the highest score.

Parameters:

documents (list[Document]) – List of Document objects to be filtered.
top_p (float | None) – If specified, a float to override the cumulative probability threshold set during initialization.

Returns:

– A dictionary with the following key:
documents: List of Document objects that have been selected based on the top-p sampling.

Raises:

ValueError – If the top_p value is not within the range [0, 1].

top_p​

TopPSampler​

init​

run​

top_p

TopPSampler

init

run