SASEvaluator
The SASEvaluator
evaluates answers predicted by Haystack pipelines using ground truth labels. It checks the semantic similarity of a predicted answer and the ground truth answer using a fine-tuned language model. This metric is called semantic answer similarity.
Most common position in a pipeline | On its own or in an evaluation pipeline. To be used after a separate pipeline that has generated the inputs for the Evaluator. |
Mandatory init variables | "token": A HF API token. Can be set with HF_API_TOKEN or HF_TOKEN env var. |
Mandatory run variables | "ground_truth_answers": A list of strings containing the ground truth answers "predicted_answers": A list of strings containing the predicted answers to be evaluated |
Output variables | A dictionary containing: - score : A number from 0.0 to 1.0 representing the mean SAS score for all pairs of predicted answers and ground truth answers- individual_scores : A list of the SAS scores ranging from 0.0 to 1.0 of all pairs of predicted answers and ground truth answers |
API reference | Evaluators |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/evaluators/sas_evaluator.py |
Overview
You can use the SASEvaluator
component to evaluate answers predicted by a Haystack pipeline, such as a RAG pipeline, against ground truth labels.
You can provide a bi-encoder or cross-encoder model to initialize a SASEvaluator
. By default, sentence-transformers/paraphrase-multilingual-mpnet-base-v2
model is used.
Note that only one predicted answer is compared to one ground truth answer at a time. The component does not support multiple ground truth answers for the same question or multiple answers predicted for the same question.
Usage
On its own
Below is an example of using a SASEvaluator
component to evaluate two answers and compare them to ground truth answers. We need to call warm_up()
before run()
to load the model.
from haystack.components.evaluators import SASEvaluator
sas_evaluator = SASEvaluator()
sas_evaluator.warm_up()
result = sas_evaluator.run(
ground_truth_answers=["Berlin", "Paris"],
predicted_answers=["Berlin", "Lyon"]
)
print(result["individual_scores"])
# [[array([[0.99999994]], dtype=float32), array([[0.51747656]], dtype=float32)]
print(result["score"])
# 0.7587383
In a pipeline
Below is an example where we use an AnswerExactMatchEvaluator
and a SASEvaluator
in a pipeline to evaluate two answers and compare them to ground truth answers. Running a pipeline instead of the individual components simplifies calculating more than one metric.
from haystack import Pipeline
from haystack.components.evaluators import AnswerExactMatchEvaluator, SASEvaluator
pipeline = Pipeline()
em_evaluator = AnswerExactMatchEvaluator()
sas_evaluator = SASEvaluator()
pipeline.add_component("em_evaluator", em_evaluator)
pipeline.add_component("sas_evaluator", sas_evaluator)
ground_truth_answers = ["Berlin", "Paris"]
predicted_answers = ["Berlin", "Lyon"]
result = pipeline.run(
{
"em_evaluator": {"ground_truth_answers": ground_truth_answers,
"predicted_answers": predicted_answers},
"sas_evaluator": {"ground_truth_answers": ground_truth_answers,
"predicted_answers": predicted_answers}
}
)
for evaluator in result:
print(result[evaluator]["individual_scores"])
# [1, 0]
# [array([[0.99999994]], dtype=float32), array([[0.51747656]], dtype=float32)]
for evaluator in result:
print(result[evaluator]["score"])
# 0.5
# 0.7587383
Additional References
🧑🍳 Cookbook: Prompt Optimization with DSPy
Updated 3 months ago