Haystack has all the tools needed to evaluate entire pipelines or individual components like Retrievers, Readers, or Generators. This guide explains how to evaluate your pipeline in different scenarios and how to understand the metrics.

Use evaluation and its results to:

Judge how well your system is performing on a given domain,
Compare the performance of different models,
Identify underperforming components in your pipeline.

Evaluation Options

Evaluating individual components or end-to-end pipelines.

Evaluating individual components can help understand performance bottlenecks and optimize one component at a time, for example, a Retriever or a prompt used with a Generator.

End-to-end evaluation checks how the full pipeline is used and evaluates only the final outputs. The pipeline is approached as a black box.

Using ground-truth labels or no labels at all.

Most statistical evaluators require ground truth labels, such as the documents relevant to the query or the expected answer. In contrast, most model-based evaluators work without any labels just by following the prompt instructions. However, few-shot labels included in the prompt can improve the evaluator.

Model-based evaluation using a language model or statistical evaluation.

Model-based evaluation uses LLMs with prompt instructions or smaller fine-tuned models to score aspects of a pipeline’s outputs. Statistical evaluation requires no models and is thus a more lightweight way to score pipeline outputs. For more information, see our docs on model-based evaluation and statistical evaluation.

Evaluator Components

Evaluator	Evaluates Answers or Documents	Model-based or Statistical	Requires Labels
AnswerExactMatchEvaluator	Answers	Statistical	Yes
ContextRelevanceEvaluator	Documents	Model-based	No
DocumentMRREvaluator	Documents	Statistical	Yes
DocumentMAPEvaluator	Documents	Statistical	Yes
DocumentRecallEvaluator	Documents	Statistical	Yes
FaithfulnessEvaluator	Answers	Model-based	No
LLMEvaluator	User-defined	Model-based	No
SASEvaluator	Answers	Model-based	Yes

Evaluator Integrations

To learn more about our integration with the Ragas and DeepEval evaluation frameworks, head over to the RagasEvaluator and DeepEvalEvaluator component docs.

To get started using practical examples, check out our evaluation tutorial or the respective cookbooks below.

Additional References

📓 Tutorial: Evaluating RAG Pipelines

🧑‍🍳 Cookbooks: