Reorders a set of Documents based on their relevance to the query.
Module lost_in_the_middle
LostInTheMiddleRanker
A LostInTheMiddle Ranker.
Ranks documents based on the 'lost in the middle' order so that the most relevant documents are either at the beginning or end, while the least relevant are in the middle.
LostInTheMiddleRanker assumes that some prior component in the pipeline has already ranked documents by relevance and requires no query as input but only documents. It is typically used as the last component before building a prompt for an LLM to prepare the input context for the LLM.
Lost in the Middle ranking lays out document contents into LLM context so that the most relevant contents are at the beginning or end of the input context, while the least relevant is in the middle of the context. See the paper "Lost in the Middle: How Language Models Use Long Contexts" for more details.
Usage example:
from haystack.components.rankers import LostInTheMiddleRanker
from haystack import Document
ranker = LostInTheMiddleRanker()
docs = [Document(content="Paris"), Document(content="Berlin"), Document(content="Madrid")]
result = ranker.run(documents=docs)
for doc in result["documents"]:
print(doc.content)
LostInTheMiddleRanker.__init__
def __init__(word_count_threshold: Optional[int] = None,
top_k: Optional[int] = None)
Initialize the LostInTheMiddleRanker.
If 'word_count_threshold' is specified, this ranker includes all documents up until the point where adding another document would exceed the 'word_count_threshold'. The last document that causes the threshold to be breached will be included in the resulting list of documents, but all subsequent documents will be discarded.
Arguments:
word_count_threshold
: The maximum total number of words across all documents selected by the ranker.top_k
: The maximum number of documents to return.
LostInTheMiddleRanker.run
@component.output_types(documents=List[Document])
def run(documents: List[Document],
top_k: Optional[int] = None,
word_count_threshold: Optional[int] = None
) -> Dict[str, List[Document]]
Reranks documents based on the "lost in the middle" order.
Arguments:
documents
: List of Documents to reorder.top_k
: The maximum number of documents to return.word_count_threshold
: The maximum total number of words across all documents selected by the ranker.
Raises:
ValueError
: If any of the documents is not textual.
Returns:
A dictionary with the following keys:
documents
: Reranked list of Documents
Module meta_field
MetaFieldRanker
Ranks Documents based on the value of their specific meta field.
The ranking can be performed in descending order or ascending order.
Usage example:
from haystack import Document
from haystack.components.rankers import MetaFieldRanker
ranker = MetaFieldRanker(meta_field="rating")
docs = [
Document(content="Paris", meta={"rating": 1.3}),
Document(content="Berlin", meta={"rating": 0.7}),
Document(content="Barcelona", meta={"rating": 2.1}),
]
output = ranker.run(documents=docs)
docs = output["documents"]
assert docs[0].content == "Barcelona"
MetaFieldRanker.__init__
def __init__(meta_field: str,
weight: float = 1.0,
top_k: Optional[int] = None,
ranking_mode: Literal["reciprocal_rank_fusion",
"linear_score"] = "reciprocal_rank_fusion",
sort_order: Literal["ascending", "descending"] = "descending",
missing_meta: Literal["drop", "top", "bottom"] = "bottom",
meta_value_type: Optional[Literal["float", "int",
"date"]] = None)
Creates an instance of MetaFieldRanker.
Arguments:
meta_field
: The name of the meta field to rank by.weight
: In range [0,1]. 0 disables ranking by a meta field. 0.5 ranking from previous component and based on meta field have the same weight. 1 ranking by a meta field only.top_k
: The maximum number of Documents to return per query. If not provided, the Ranker returns all documents it receives in the new ranking order.ranking_mode
: The mode used to combine the Retriever's and Ranker's scores. Possible values are 'reciprocal_rank_fusion' (default) and 'linear_score'. Use the 'linear_score' mode only with Retrievers or Rankers that return a score in range [0,1].sort_order
: Whether to sort the meta field by ascending or descending order. Possible values aredescending
(default) andascending
.missing_meta
: What to do with documents that are missing the sorting metadata field. Possible values are:- 'drop' will drop the documents entirely.
- 'top' will place the documents at the top of the metadata-sorted list (regardless of 'ascending' or 'descending').
- 'bottom' will place the documents at the bottom of metadata-sorted list (regardless of 'ascending' or 'descending').
meta_value_type
: Parse the meta value into the data type specified before sorting. This will only work if all meta values stored undermeta_field
in the provided documents are strings. For example, if we specifiedmeta_value_type="date"
then for the meta value"date": "2015-02-01"
we would parse the string into a datetime object and then sort the documents by date. The available options are:- 'float' will parse the meta values into floats.
- 'int' will parse the meta values into integers.
- 'date' will parse the meta values into datetime objects.
- 'None' (default) will do no parsing.
MetaFieldRanker.run
@component.output_types(documents=List[Document])
def run(documents: List[Document],
top_k: Optional[int] = None,
weight: Optional[float] = None,
ranking_mode: Optional[Literal["reciprocal_rank_fusion",
"linear_score"]] = None,
sort_order: Optional[Literal["ascending", "descending"]] = None,
missing_meta: Optional[Literal["drop", "top", "bottom"]] = None,
meta_value_type: Optional[Literal["float", "int", "date"]] = None)
Ranks a list of Documents based on the selected meta field by:
- Sorting the Documents by the meta field in descending or ascending order.
- Merging the rankings from the previous component and based on the meta field according to ranking mode and weight.
- Returning the top-k documents.
Arguments:
documents
: Documents to be ranked.top_k
: The maximum number of Documents to return per query. If not provided, the top_k provided at initialization time is used.weight
: In range [0,1]. 0 disables ranking by a meta field. 0.5 ranking from previous component and based on meta field have the same weight. 1 ranking by a meta field only. If not provided, the weight provided at initialization time is used.ranking_mode
: (optional) The mode used to combine the Retriever's and Ranker's scores. Possible values are 'reciprocal_rank_fusion' (default) and 'linear_score'. Use the 'score' mode only with Retrievers or Rankers that return a score in range [0,1]. If not provided, the ranking_mode provided at initialization time is used.sort_order
: Whether to sort the meta field by ascending or descending order. Possible values aredescending
(default) andascending
. If not provided, the sort_order provided at initialization time is used.missing_meta
: What to do with documents that are missing the sorting metadata field. Possible values are:- 'drop' will drop the documents entirely.
- 'top' will place the documents at the top of the metadata-sorted list (regardless of 'ascending' or 'descending').
- 'bottom' will place the documents at the bottom of metadata-sorted list (regardless of 'ascending' or 'descending'). If not provided, the missing_meta provided at initialization time is used.
meta_value_type
: Parse the meta value into the data type specified before sorting. This will only work if all meta values stored undermeta_field
in the provided documents are strings. For example, if we specifiedmeta_value_type="date"
then for the meta value"date": "2015-02-01"
we would parse the string into a datetime object and then sort the documents by date. The available options are: -'float' will parse the meta values into floats. -'int' will parse the meta values into integers. -'date' will parse the meta values into datetime objects. -'None' (default) will do no parsing.
Raises:
ValueError
: Iftop_k
is not > 0. Ifweight
is not in range [0,1]. Ifranking_mode
is not 'reciprocal_rank_fusion' or 'linear_score'. Ifsort_order
is not 'ascending' or 'descending'. Ifmeta_value_type
is not 'float', 'int', 'date' orNone
.
Returns:
A dictionary with the following keys:
documents
: List of Documents sorted by the specified meta field.
Module meta_field_grouping_ranker
MetaFieldGroupingRanker
Reorders the documents by grouping them based on metadata keys.
The MetaFieldGroupingRanker can group documents by a primary metadata key group_by
, and subgroup them with an optional
secondary key, subgroup_by
.
Within each group or subgroup, it can also sort documents by a metadata key sort_docs_by
.
The output is a flat list of documents ordered by group_by
and subgroup_by
values.
Any documents without a group are placed at the end of the list.
The proper organization of documents helps improve the efficiency and performance of subsequent processing by an LLM.
Usage example
from haystack.components.rankers import MetaFieldGroupingRanker
from haystack.dataclasses import Document
docs = [
Document(content="Javascript is a popular programming language", meta={"group": "42", "split_id": 7, "subgroup": "subB"}),
Document(content="Python is a popular programming language",meta={"group": "42", "split_id": 4, "subgroup": "subB"}),
Document(content="A chromosome is a package of DNA", meta={"group": "314", "split_id": 2, "subgroup": "subC"}),
Document(content="An octopus has three hearts", meta={"group": "11", "split_id": 2, "subgroup": "subD"}),
Document(content="Java is a popular programming language", meta={"group": "42", "split_id": 3, "subgroup": "subB"})
]
ranker = MetaFieldGroupingRanker(group_by="group",subgroup_by="subgroup", sort_docs_by="split_id")
result = ranker.run(documents=docs)
print(result["documents"])
# [
# Document(id=d665bbc83e52c08c3d8275bccf4f22bf2bfee21c6e77d78794627637355b8ebc,
# content: 'Java is a popular programming language', meta: {'group': '42', 'split_id': 3, 'subgroup': 'subB'}),
# Document(id=a20b326f07382b3cbf2ce156092f7c93e8788df5d48f2986957dce2adb5fe3c2,
# content: 'Python is a popular programming language', meta: {'group': '42', 'split_id': 4, 'subgroup': 'subB'}),
# Document(id=ce12919795d22f6ca214d0f161cf870993889dcb146f3bb1b3e1ffdc95be960f,
# content: 'Javascript is a popular programming language', meta: {'group': '42', 'split_id': 7, 'subgroup': 'subB'}),
# Document(id=d9fc857046c904e5cf790b3969b971b1bbdb1b3037d50a20728fdbf82991aa94,
# content: 'A chromosome is a package of DNA', meta: {'group': '314', 'split_id': 2, 'subgroup': 'subC'}),
# Document(id=6d3b7bdc13d09aa01216471eb5fb0bfdc53c5f2f3e98ad125ff6b85d3106c9a3,
# content: 'An octopus has three hearts', meta: {'group': '11', 'split_id': 2, 'subgroup': 'subD'})
# ]
MetaFieldGroupingRanker.__init__
def __init__(group_by: str,
subgroup_by: Optional[str] = None,
sort_docs_by: Optional[str] = None)
Creates an instance of MetaFieldGroupingRanker.
Arguments:
group_by
: The metadata key to aggregate the documents by.subgroup_by
: The metadata key to aggregate the documents within a group that was created by thegroup_by
key.sort_docs_by
: Determines which metadata key is used to sort the documents. If not provided, the documents within the groups or subgroups are not sorted and are kept in the same order as they were inserted in the subgroups.
MetaFieldGroupingRanker.run
@component.output_types(documents=List[Document])
def run(documents: List[Document]) -> Dict[str, Any]
Groups the provided list of documents based on the group_by
parameter and optionally the subgroup_by
.
The output is a list of documents reordered based on how they were grouped.
Arguments:
documents
: The list of documents to group.
Returns:
A dictionary with the following keys:
- documents: The list of documents ordered by the
group_by
andsubgroup_by
metadata values.
Module transformers_similarity
TransformersSimilarityRanker
Ranks documents based on their semantic similarity to the query.
It uses a pre-trained cross-encoder model from Hugging Face to embed the query and the documents.
Usage example
from haystack import Document
from haystack.components.rankers import TransformersSimilarityRanker
ranker = TransformersSimilarityRanker()
docs = [Document(content="Paris"), Document(content="Berlin")]
query = "City in Germany"
ranker.warm_up()
result = ranker.run(query=query, documents=docs)
docs = result["documents"]
print(docs[0].content)
TransformersSimilarityRanker.__init__
def __init__(model: Union[str, Path] = "cross-encoder/ms-marco-MiniLM-L-6-v2",
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
top_k: int = 10,
query_prefix: str = "",
document_prefix: str = "",
meta_fields_to_embed: Optional[List[str]] = None,
embedding_separator: str = "\n",
scale_score: bool = True,
calibration_factor: Optional[float] = 1.0,
score_threshold: Optional[float] = None,
model_kwargs: Optional[Dict[str, Any]] = None,
tokenizer_kwargs: Optional[Dict[str, Any]] = None,
batch_size: int = 16)
Creates an instance of TransformersSimilarityRanker.
Arguments:
model
: The ranking model. Pass a local path or the Hugging Face model name of a cross-encoder model.device
: The device on which the model is loaded. IfNone
, overrides the default device.token
: The API token to download private models from Hugging Face.top_k
: The maximum number of documents to return per query.query_prefix
: A string to add at the beginning of the query text before ranking. Use it to prepend the text with an instruction, as required by reranking models likebge
.document_prefix
: A string to add at the beginning of each document before ranking. You can use it to prepend the document with an instruction, as required by embedding models likebge
.meta_fields_to_embed
: List of metadata fields to embed with the document.embedding_separator
: Separator to concatenate metadata fields to the document.scale_score
: IfTrue
, scales the raw logit predictions using a Sigmoid activation function. IfFalse
, disables scaling of the raw logit predictions.calibration_factor
: Use this factor to calibrate probabilities withsigmoid(logits * calibration_factor)
. Used only ifscale_score
isTrue
.score_threshold
: Use it to return documents with a score above this threshold only.model_kwargs
: Additional keyword arguments forAutoModelForSequenceClassification.from_pretrained
when loading the model. Refer to specific model documentation for available kwargs.tokenizer_kwargs
: Additional keyword arguments forAutoTokenizer.from_pretrained
when loading the tokenizer. Refer to specific model documentation for available kwargs.batch_size
: The batch size to use for inference. The higher the batch size, the more memory is required. If you run into memory issues, reduce the batch size.
Raises:
ValueError
: Iftop_k
is not > 0. Ifscale_score
is True andcalibration_factor
is not provided.
TransformersSimilarityRanker.warm_up
def warm_up()
Initializes the component.
TransformersSimilarityRanker.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
TransformersSimilarityRanker.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "TransformersSimilarityRanker"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
TransformersSimilarityRanker.run
@component.output_types(documents=List[Document])
def run(query: str,
documents: List[Document],
top_k: Optional[int] = None,
scale_score: Optional[bool] = None,
calibration_factor: Optional[float] = None,
score_threshold: Optional[float] = None)
Returns a list of documents ranked by their similarity to the given query.
Arguments:
query
: The input query to compare the documents to.documents
: A list of documents to be ranked.top_k
: The maximum number of documents to return.scale_score
: IfTrue
, scales the raw logit predictions using a Sigmoid activation function. IfFalse
, disables scaling of the raw logit predictions.calibration_factor
: Use this factor to calibrate probabilities withsigmoid(logits * calibration_factor)
. Used only ifscale_score
isTrue
.score_threshold
: Use it to return documents only with a score above this threshold.
Raises:
ValueError
: Iftop_k
is not > 0. Ifscale_score
is True andcalibration_factor
is not provided.RuntimeError
: If the model is not loaded becausewarm_up()
was not called before.
Returns:
A dictionary with the following keys:
documents
: A list of documents closest to the query, sorted from most similar to least similar.
Module sentence_transformers_diversity
DiversityRankingStrategy
The strategy to use for diversity ranking.
DiversityRankingStrategy.__str__
def __str__() -> str
Convert a Strategy enum to a string.
DiversityRankingStrategy.from_str
@staticmethod
def from_str(string: str) -> "DiversityRankingStrategy"
Convert a string to a Strategy enum.
DiversityRankingSimilarity
The similarity metric to use for comparing embeddings.
DiversityRankingSimilarity.__str__
def __str__() -> str
Convert a Similarity enum to a string.
DiversityRankingSimilarity.from_str
@staticmethod
def from_str(string: str) -> "DiversityRankingSimilarity"
Convert a string to a Similarity enum.
SentenceTransformersDiversityRanker
A Diversity Ranker based on Sentence Transformers.
Applies a document ranking algorithm based on one of the two strategies:
-
Greedy Diversity Order:
Implements a document ranking algorithm that orders documents in a way that maximizes the overall diversity of the documents based on their similarity to the query.
It uses a pre-trained Sentence Transformers model to embed the query and the documents.
-
Maximum Margin Relevance:
Implements a document ranking algorithm that orders documents based on their Maximum Margin Relevance (MMR) scores.
MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents. The algorithm iteratively selects documents based on their MMR scores, balancing between relevance to the query and diversity from already selected documents. The 'lambda_threshold' controls the trade-off between relevance and diversity.
Usage example
from haystack import Document
from haystack.components.rankers import SentenceTransformersDiversityRanker
ranker = SentenceTransformersDiversityRanker(model="sentence-transformers/all-MiniLM-L6-v2", similarity="cosine", strategy="greedy_diversity_order")
ranker.warm_up()
docs = [Document(content="Paris"), Document(content="Berlin")]
query = "What is the capital of germany?"
output = ranker.run(query=query, documents=docs)
docs = output["documents"]
SentenceTransformersDiversityRanker.__init__
def __init__(
model: str = "sentence-transformers/all-MiniLM-L6-v2",
top_k: int = 10,
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
similarity: Union[str, DiversityRankingSimilarity] = "cosine",
query_prefix: str = "",
query_suffix: str = "",
document_prefix: str = "",
document_suffix: str = "",
meta_fields_to_embed: Optional[List[str]] = None,
embedding_separator: str = "\n",
strategy: Union[str,
DiversityRankingStrategy] = "greedy_diversity_order",
lambda_threshold: float = 0.5)
Initialize a SentenceTransformersDiversityRanker.
Arguments:
model
: Local path or name of the model in Hugging Face's model hub, such as'sentence-transformers/all-MiniLM-L6-v2'
.top_k
: The maximum number of Documents to return per query.device
: The device on which the model is loaded. IfNone
, the default device is automatically selected.token
: The API token used to download private models from Hugging Face.similarity
: Similarity metric for comparing embeddings. Can be set to "dot_product" (default) or "cosine".query_prefix
: A string to add to the beginning of the query text before ranking. Can be used to prepend the text with an instruction, as required by some embedding models, such as E5 and BGE.query_suffix
: A string to add to the end of the query text before ranking.document_prefix
: A string to add to the beginning of each Document text before ranking. Can be used to prepend the text with an instruction, as required by some embedding models, such as E5 and BGE.document_suffix
: A string to add to the end of each Document text before ranking.meta_fields_to_embed
: List of meta fields that should be embedded along with the Document content.embedding_separator
: Separator used to concatenate the meta fields to the Document content.strategy
: The strategy to use for diversity ranking. Can be either "greedy_diversity_order" or "maximum_margin_relevance".lambda_threshold
: The trade-off parameter between relevance and diversity. Only used when strategy is "maximum_margin_relevance".
SentenceTransformersDiversityRanker.warm_up
def warm_up()
Initializes the component.
SentenceTransformersDiversityRanker.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
SentenceTransformersDiversityRanker.from_dict
@classmethod
def from_dict(cls, data: Dict[str,
Any]) -> "SentenceTransformersDiversityRanker"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
SentenceTransformersDiversityRanker.run
@component.output_types(documents=List[Document])
def run(query: str,
documents: List[Document],
top_k: Optional[int] = None,
lambda_threshold: Optional[float] = None) -> Dict[str, List[Document]]
Rank the documents based on their diversity.
Arguments:
query
: The search query.documents
: List of Document objects to be ranker.top_k
: Optional. An integer to override the top_k set during initialization.lambda_threshold
: Override the trade-off parameter between relevance and diversity. Only used when strategy is "maximum_margin_relevance".
Raises:
ValueError
: If the top_k value is less than or equal to 0.RuntimeError
: If the component has not been warmed up.
Returns:
A dictionary with the following key:
documents
: List of Document objects that have been selected based on the diversity ranking.