Most common position in a pipeline	On its own or in an evaluation pipeline. To be used after a separate pipeline that has generated the inputs for the Evaluator.
Mandatory run variables	"ground_truth_answers": A list of strings containing the ground truth answers "predicted_answers": A list of strings containing the predicted answers to be evaluated
Output variables	A dictionary containing: - `score`: A number from 0.0 to 1.0 representing the proportion of questions in which any predicted answer matched the ground truth answers - `individual_scores`: A list of 0s and 1s, where 1 means that the predicted answer matched one of the ground truths
API reference	Evaluators
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/evaluators/answer_exact_match.py

Overview

You can use the AnswerExactMatchEvaluator component to evaluate answers predicted by a Haystack pipeline, such as an extractive question answering pipeline, against ground truth labels. As the AnswerExactMatchEvaluator checks whether a predicted answer exactly matches the ground truth answer. It is not suited to evaluate answers generated by LLMs, for example, in a RAG pipeline. Use FaithfulnessEvaluator or SASEvaluator instead.

To initialize an AnswerExactMatchEvaluator, there are no parameters required.

Note that only one predicted answer is compared to one ground truth answer at a time. The component does not support multiple ground truth answers for the same question or multiple answers predicted for the same question.

Usage

On its own

Below is an example of using an AnswerExactMatchEvaluator component to evaluate two answers and compare them to ground truth answers.

from haystack.components.evaluators import AnswerExactMatchEvaluator

evaluator = AnswerExactMatchEvaluator()
result = evaluator.run(
    ground_truth_answers=["Berlin", "Paris"],
    predicted_answers=["Berlin", "Lyon"],
)

print(result["individual_scores"])
# [1, 0]
print(result["score"])
# 0.5

In a pipeline

Below is an example where we use an AnswerExactMatchEvaluator and a SASEvaluator in a pipeline to evaluate two answers and compare them to ground truth answers. Running a pipeline instead of the individual components simplifies calculating more than one metric.

from haystack import Pipeline
from haystack.components.evaluators import AnswerExactMatchEvaluator
from haystack.components.evaluators import SASEvaluator

pipeline = Pipeline()
em_evaluator = AnswerExactMatchEvaluator()
sas_evaluator = SASEvaluator()
pipeline.add_component("em_evaluator", em_evaluator)
pipeline.add_component("sas_evaluator", sas_evaluator)

ground_truth_answers = ["Berlin", "Paris"]
predicted_answers = ["Berlin", "Lyon"]

result = pipeline.run(
		{
			"em_evaluator": {"ground_truth_answers": ground_truth_answers,
	    "predicted_answers": predicted_answers},
	    "sas_evaluator": {"ground_truth_answers": ground_truth_answers,
	    "predicted_answers": predicted_answers}
    }
)

for evaluator in result:
    print(result[evaluator]["individual_scores"])
# [1, 0]
# [array([[0.99999994]], dtype=float32), array([[0.51747656]], dtype=float32)]

for evaluator in result:
    print(result[evaluator]["score"])
# 0.5
# 0.7587383