Most common position in a pipeline	On its own or in an evaluation pipeline. To be used after a separate pipeline that has generated the inputs for the Evaluator.
Mandatory init variables	"token": A HF API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var.
Mandatory run variables	"ground_truth_answers": A list of strings containing the ground truth answers "predicted_answers": A list of strings containing the predicted answers to be evaluated
Output variables	A dictionary containing: - `score`: A number from 0.0 to 1.0 representing the mean SAS score for all pairs of predicted answers and ground truth answers - `individual_scores`: A list of the SAS scores ranging from 0.0 to 1.0 of all pairs of predicted answers and ground truth answers
API reference	Evaluators
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/evaluators/sas_evaluator.py

Overview

You can use the SASEvaluator component to evaluate answers predicted by a Haystack pipeline, such as a RAG pipeline, against ground truth labels.

You can provide a bi-encoder or cross-encoder model to initialize a SASEvaluator. By default, sentence-transformers/paraphrase-multilingual-mpnet-base-v2 model is used.

Note that only one predicted answer is compared to one ground truth answer at a time. The component does not support multiple ground truth answers for the same question or multiple answers predicted for the same question.

Usage

On its own

Below is an example of using a SASEvaluator component to evaluate two answers and compare them to ground truth answers. We need to call warm_up() before run() to load the model.

from haystack.components.evaluators import SASEvaluator

sas_evaluator = SASEvaluator()
sas_evaluator.warm_up()
result = sas_evaluator.run(
  ground_truth_answers=["Berlin", "Paris"], 
  predicted_answers=["Berlin", "Lyon"]
)
print(result["individual_scores"])
# [[array([[0.99999994]], dtype=float32), array([[0.51747656]], dtype=float32)]
print(result["score"])
# 0.7587383

In a pipeline

Below is an example where we use an AnswerExactMatchEvaluator and a SASEvaluator in a pipeline to evaluate two answers and compare them to ground truth answers. Running a pipeline instead of the individual components simplifies calculating more than one metric.

from haystack import Pipeline
from haystack.components.evaluators import AnswerExactMatchEvaluator, SASEvaluator

pipeline = Pipeline()
em_evaluator = AnswerExactMatchEvaluator()
sas_evaluator = SASEvaluator()
pipeline.add_component("em_evaluator", em_evaluator)
pipeline.add_component("sas_evaluator", sas_evaluator)

ground_truth_answers = ["Berlin", "Paris"]
predicted_answers = ["Berlin", "Lyon"]

result = pipeline.run(
		{
			"em_evaluator": {"ground_truth_answers": ground_truth_answers,
	    "predicted_answers": predicted_answers},
	    "sas_evaluator": {"ground_truth_answers": ground_truth_answers,
	    "predicted_answers": predicted_answers}
    }
)

for evaluator in result:
    print(result[evaluator]["individual_scores"])
# [1, 0]
# [array([[0.99999994]], dtype=float32), array([[0.51747656]], dtype=float32)]

for evaluator in result:
    print(result[evaluator]["score"])
# 0.5
# 0.7587383

Additional References

🧑‍🍳 Cookbook: Prompt Optimization with DSPy