Reorders a set of Documents based on their relevance to the query.
Module lost_in_the_middle
LostInTheMiddleRanker
@component
class LostInTheMiddleRanker()
Ranks documents based on the 'lost in the middle' order so that the most relevant documents are either at the beginning or end, while the least relevant are in the middle.
LostInTheMiddleRanker assumes that some prior component in the pipeline has already ranked documents by relevance and requires no query as input but only documents. It is typically used as the last component before building a prompt for an LLM to prepare the input context for the LLM.
Lost in the Middle ranking lays out document contents into LLM context so that the most relevant contents are at the beginning or end of the input context, while the least relevant is in the middle of the context. See the paper "Lost in the Middle: How Language Models Use Long Contexts" for more details.
Usage example:
from haystack.components.rankers import LostInTheMiddleRanker
from haystack import Document
ranker = LostInTheMiddleRanker()
docs = [Document(content="Paris"), Document(content="Berlin"), Document(content="Madrid")]
result = ranker.run(documents=docs)
for doc in result["documents"]:
print(doc.content)
LostInTheMiddleRanker.__init__
def __init__(word_count_threshold: Optional[int] = None,
top_k: Optional[int] = None)
If 'word_count_threshold' is specified, this ranker includes all documents up until the point where adding
another document would exceed the 'word_count_threshold'. The last document that causes the threshold to be breached will be included in the resulting list of documents, but all subsequent documents will be discarded.
Arguments:
word_count_threshold
: The maximum total number of words across all documents selected by the ranker.top_k
: The maximum number of documents to return.
LostInTheMiddleRanker.run
@component.output_types(documents=List[Document])
def run(documents: List[Document],
top_k: Optional[int] = None,
word_count_threshold: Optional[int] = None
) -> Dict[str, List[Document]]
Reranks documents based on the "lost in the middle" order.
Arguments:
documents
: List of Documents to reorder.top_k
: The maximum number of documents to return.word_count_threshold
: The maximum total number of words across all documents selected by the ranker.
Raises:
ValueError
: If any of the documents is not textual.
Returns:
A dictionary with the following keys:
documents
: Reranked list of Documents
Module meta_field
MetaFieldRanker
@component
class MetaFieldRanker()
Ranks Documents based on the value of their specific meta field.
The ranking can be performed in descending order or ascending order.
Usage example:
from haystack import Document
from haystack.components.rankers import MetaFieldRanker
ranker = MetaFieldRanker(meta_field="rating")
docs = [
Document(content="Paris", meta={"rating": 1.3}),
Document(content="Berlin", meta={"rating": 0.7}),
Document(content="Barcelona", meta={"rating": 2.1}),
]
output = ranker.run(documents=docs)
docs = output["documents"]
assert docs[0].content == "Barcelona"
<a id="meta_field.MetaFieldRanker.__init__"></a>
#### MetaFieldRanker.\_\_init\_\_
```python
def __init__(meta_field: str,
weight: float = 1.0,
top_k: Optional[int] = None,
ranking_mode: Literal["reciprocal_rank_fusion",
"linear_score"] = "reciprocal_rank_fusion",
sort_order: Literal["ascending", "descending"] = "descending",
meta_value_type: Optional[Literal["float", "int",
"date"]] = None)
Creates an instance of MetaFieldRanker.
Arguments:
meta_field
: The name of the meta field to rank by.weight
: In range [0,1]. 0 disables ranking by a meta field. 0.5 ranking from previous component and based on meta field have the same weight. 1 ranking by a meta field only.top_k
: The maximum number of Documents to return per query. If not provided, the Ranker returns all documents it receives in the new ranking order.ranking_mode
: The mode used to combine the Retriever's and Ranker's scores. Possible values are 'reciprocal_rank_fusion' (default) and 'linear_score'. Use the 'linear_score' mode only with Retrievers or Rankers that return a score in range [0,1].sort_order
: Whether to sort the meta field by ascending or descending order. Possible values aredescending
(default) andascending
.meta_value_type
: Parse the meta value into the data type specified before sorting. This will only work if all meta values stored undermeta_field
in the provided documents are strings. For example, if we specifiedmeta_value_type="date"
then for the meta value"date": "2015-02-01"
we would parse the string into a datetime object and then sort the documents by date. The available options are:- 'float' will parse the meta values into floats.
- 'int' will parse the meta values into integers.
- 'date' will parse the meta values into datetime objects.
- 'None' (default) will do no parsing.
MetaFieldRanker.run
@component.output_types(documents=List[Document])
def run(documents: List[Document],
top_k: Optional[int] = None,
weight: Optional[float] = None,
ranking_mode: Optional[Literal["reciprocal_rank_fusion",
"linear_score"]] = None,
sort_order: Optional[Literal["ascending", "descending"]] = None,
meta_value_type: Optional[Literal["float", "int", "date"]] = None)
Ranks a list of Documents based on the selected meta field by:
- Sorting the Documents by the meta field in descending or ascending order.
- Merging the rankings from the previous component and based on the meta field according to ranking mode and weight.
- Returning the top-k documents.
Arguments:
documents
: Documents to be ranked.top_k
: The maximum number of Documents to return per query. If not provided, the top_k provided at initialization time is used.weight
: In range [0,1]. 0 disables ranking by a meta field. 0.5 ranking from previous component and based on meta field have the same weight. 1 ranking by a meta field only. If not provided, the weight provided at initialization time is used.ranking_mode
: (optional) The mode used to combine the Retriever's and Ranker's scores. Possible values are 'reciprocal_rank_fusion' (default) and 'linear_score'. Use the 'score' mode only with Retrievers or Rankers that return a score in range [0,1]. If not provided, the ranking_mode provided at initialization time is used.sort_order
: Whether to sort the meta field by ascending or descending order. Possible values aredescending
(default) andascending
. If not provided, the sort_order provided at initialization time is used.meta_value_type
: Parse the meta value into the data type specified before sorting. This will only work if all meta values stored undermeta_field
in the provided documents are strings. For example, if we specifiedmeta_value_type="date"
then for the meta value"date": "2015-02-01"
we would parse the string into a datetime object and then sort the documents by date. The available options are: -'float' will parse the meta values into floats. -'int' will parse the meta values into integers. -'date' will parse the meta values into datetime objects. -'None' (default) will do no parsing.
Raises:
ValueError
: Iftop_k
is not > 0. Ifweight
is not in range [0,1]. Ifranking_mode
is not 'reciprocal_rank_fusion' or 'linear_score'. Ifsort_order
is not 'ascending' or 'descending'. Ifmeta_value_type
is not 'float', 'int', 'date' orNone
.
Returns:
A dictionary with the following keys:
documents
: List of Documents sorted by the specified meta field.
Module transformers_similarity
TransformersSimilarityRanker
@component
class TransformersSimilarityRanker()
Ranks Documents based on their similarity to the query.
It uses a pre-trained cross-encoder model (from the Hugging Face Hub) to embed the query and the Documents.
Usage example:
from haystack import Document
from haystack.components.rankers import TransformersSimilarityRanker
ranker = TransformersSimilarityRanker()
docs = [Document(content="Paris"), Document(content="Berlin")]
query = "City in Germany"
ranker.warm_up()
result = ranker.run(query=query, documents=docs)
docs = result["documents"]
print(docs[0].content)
TransformersSimilarityRanker.__init__
def __init__(model: Union[str, Path] = "cross-encoder/ms-marco-MiniLM-L-6-v2",
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var("HF_API_TOKEN",
strict=False),
top_k: int = 10,
query_prefix: str = "",
document_prefix: str = "",
meta_fields_to_embed: Optional[List[str]] = None,
embedding_separator: str = "\n",
scale_score: bool = True,
calibration_factor: Optional[float] = 1.0,
score_threshold: Optional[float] = None,
model_kwargs: Optional[Dict[str, Any]] = None)
Creates an instance of TransformersSimilarityRanker.
Arguments:
model
: The name or path of a pre-trained cross-encoder model from the Hugging Face Hub.device
: The device on which the model is loaded. IfNone
, the default device is automatically selected.token
: The API token used to download private models from Hugging Face.top_k
: The maximum number of Documents to return per query.query_prefix
: A string to add to the beginning of the query text before ranking. Can be used to prepend the text with an instruction, as required by some reranking models, such as bge.document_prefix
: A string to add to the beginning of each Document text before ranking. Can be used to prepend the text with an instruction, as required by some embedding models, such as bge.meta_fields_to_embed
: List of meta fields that should be embedded along with the Document content.embedding_separator
: Separator used to concatenate the meta fields to the Document content.scale_score
: Whether the raw logit predictions will be scaled using a Sigmoid activation function. Set this to False if you do not want any scaling of the raw logit predictions.calibration_factor
: Factor used for calibrating probabilities calculated bysigmoid(logits * calibration_factor)
. This is only used ifscale_score
is set to True.score_threshold
: If provided only returns documents with a score above this threshold.model_kwargs
: Additional keyword arguments passed toAutoModelForSequenceClassification.from_pretrained
when loading the model specified inmodel
. For details on what kwargs you can pass, see the model's documentation.
Raises:
ValueError
: Iftop_k
is not > 0. Ifscale_score
is True andcalibration_factor
is not provided.
TransformersSimilarityRanker.warm_up
def warm_up()
Initializes the component.
TransformersSimilarityRanker.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
TransformersSimilarityRanker.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "TransformersSimilarityRanker"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
TransformersSimilarityRanker.run
@component.output_types(documents=List[Document])
def run(query: str,
documents: List[Document],
top_k: Optional[int] = None,
scale_score: Optional[bool] = None,
calibration_factor: Optional[float] = None,
score_threshold: Optional[float] = None)
Returns a list of Documents ranked by their similarity to the given query.
Arguments:
query
: Query string.documents
: List of Documents.top_k
: The maximum number of Documents you want the Ranker to return.scale_score
: Whether the raw logit predictions will be scaled using a Sigmoid activation function. Set this to False if you do not want any scaling of the raw logit predictions.calibration_factor
: Factor used for calibrating probabilities calculated bysigmoid(logits * calibration_factor)
. This is only used ifscale_score
is set to True.score_threshold
: If provided only returns documents with a score above this threshold.
Raises:
ValueError
: Iftop_k
is not > 0. Ifscale_score
is True andcalibration_factor
is not provided.ComponentError
: If the model is not loaded becausewarm_up()
was not called before.
Returns:
A dictionary with the following keys:
documents
: List of Documents most similar to the given query in descending order of similarity.