DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord

Statistical Evaluation

Haystack supports various statistical evaluation metrics. This page explains what statistical evaluation is and discusses the various options available within Haystack.

Introduction

Statistical evaluation in Haystack compares ground truth labels with pipeline predictions, typically using metrics such as precision or recall. It's often used to evaluate the Retriever component within Retrieval-Augmented Generative (RAG) pipelines, but this methodology can be adapted for any pipeline if ground truth labels of relevant documents are available.

When evaluating answers, such as those predicted by an extractive question answering pipeline, the ground truth labels of expected answers are compared to the pipeline's predictions.

For assessing answers generated by LLMs with one of Haystack’s Generator components, we recommend model-based evaluation instead. It can incorporate measures of semantic similarity or coherence and is better suited to evaluate predictions that might differ in wording from the ground truth labels.

Statistical Evaluation Pipelines in Haystack

There are two ways of performing model-based evaluation in Haystack, both of which leverage pipelines and Evaluator components:

  • You can create and run an evaluation pipeline independently. This means you’ll have to provide the required inputs to the evaluation pipeline manually. We recommend this way because the separation of your RAG pipeline and your evaluation pipeline allows you to store the results of your RAG pipeline and try out different evaluation metrics afterward without needing to re-run your pipeline every time.
  • As another option, you can add an Evaluator to the end of a RAG pipeline. This means you run both a RAG pipeline and evaluation on top of it in a single pipeline.run() call.

Statistical Evaluation of Retrieved Documents

Recall measures how often the correct document was among the retrieved documents over a set of queries. For a single query, the output is binary: either the correct document is contained in the selection, or it is not. Over the entire dataset, the recall score amounts to a number between zero (no query retrieved the right document) and one (all queries retrieved the right documents).

In some scenarios, there can be multiple correct documents for one query. The metric recall_single_hit considers whether at least one of the correct documents is retrieved, whereas recall_multi_hit takes into account how many of the multiple correct documents for one query are retrieved.

Note that recall is affected by the number of documents that the Retriever returns. If the Retriever returns few documents, it means that it is difficult to retrieve the correct documents. Make sure to set the Retriever's top_k to an appropriate value in the pipeline that you're evaluating.

DocumentMRREvaluator (Mean Reciprocal Rank)

In contrast to the recall metric, mean reciprocal rank takes the position of the top correctly retrieved document (the “rank”) into account. It does this to account for the fact that a query elicits multiple responses of varying relevance. Like recall, MRR can be a value between zero (no matches) and one (the system retrieved a correct document for all queries as the top result). For more details, check out Mean Reciprocal Rank wiki page.

DocumentMAPEvaluator (Mean Average Precision)

Mean average precision is similar to mean reciprocal rank but takes into account the position of every correctly retrieved document. Like MRR, mAP can be a value between zero (no matches) and one (the system retrieved correct documents for all top results). mAP is particularly useful in cases where there is more than one correct answer to be retrieved. For more details, check out Mean Average Precision wiki page.

Statistical Evaluation of Extracted or Generated Answers

Exact match measures the proportion of cases where the predicted Answer is identical to the correct Answer. For example, for the annotated question-answer pair “What is Haystack?" + "A question answering library in Python”, even a predicted answer like “A Python question answering library” would yield a zero score because it does not match the expected answer 100%.