DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
Documentation

UpTrainEvaluator

The UpTrainEvaluator evaluates Haystack Pipelines using LLM-based metrics. It supports metrics like context relevance, factual accuracy, response relevance, and more.

NameUpTrainEvaluator
Pathhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/uptrain
Most common Position in a PipelineOn its own or in an Evaluation Pipeline. To be used after a separate Pipeline has generated the inputs for the evaluator.
Mandatory Input variables“inputs”: A keyword arguments dictionary containing the expected inputs. The expected inputs will change based on what metric you are evaluating. See below for more details
Output variables“results”: a nested list of metric results. There can be one or more results, depending on the metric. Each result is a dictionary containing:
- name - The name of the metric.
- score - The score of the metric.
- explanation - An optional explanation of the score.

UpTrain is an evaluation framework that provides a number of LLM-based evaluation metrics. You can use the UpTrainEvaluator component to evaluate a Haystack Pipeline, such as a retrieval-augmented generated Pipeline, against one of the metrics provided by UpTrain.

Supported Metrics

UpTrain supports a number of metrics which we expose through the UpTrainMetric enumeration. Below is the list of metrics supported by the UpTrainEvaluator in Haystack with the expected metric_params while initializing the evaluator.

For a complete guide on these metrics, visit the UpTrain documentation.

MetricMetric ParametersExpected inputsMetric description
CONTEXT_RELEVANCENonequestions:List[str]
contexts:List[List[str]]
Grades how relevant the context was to the question specified.
FACTUAL_ACCURACYNonequestions: List[str] contexts:List[List[str]] responses:List[str]Grades how factual the generated response was.
RESPONSE_RELEVANCENonequestions:List[str] responses:List[str]Grades how relevant the generated response is or if it has any additional irrelevant information for the question asked.
RESPONSE_COMPLETENESSNonequestions:List[str] responses:List[str]Grades how complete the generated response was for the question specified.
RESPONSE_COMPLETENESS_WRT_CONTEXTNonequestions:List[str] contexts:List[List[str]] responses:List[str]Grades how complete the generated response was for the question specified given the information provided in the context.
RESPONSE_CONSISTENCYNonequestions:List[str] contexts:List[List[str]] responses:List[str]Grades how consistent the response is with the question asked as well as with the context provided.
RESPONSE_CONCISENESSNonequestions:List[str] responses:List[str]Grades how concise the generated response is or if it has any additional irrelevant information for the question asked.
CRITIQUE_LANGUAGENoneresponses:List[str]Evaluate the response on multiple aspects - fluency, politeness, grammar, and coherence. It provides a score for each of the aspects on a scale of 0 to 1, along with an explanation for the score.
CRITIQUE_TONEllm_personaresponses:List[str]Operator to assess the tone of machine generated responses.
GUIDELINE_ADHERENCEguideline guideline_name guideline_schemaquestions:List[str] responses:List[str]Grades how well the LLM adheres to a provided guideline when giving a response.
RESPONSE_MATCHINGmethodresponses:List[str] ground_truths:List[str]Operator to compare the LLM-generated text with the gold (ideal) response using the defined score metric.

Parameters Overview

To initialize a UpTrainEvaluator you need to provide the following parameters :

  • metric: An UpTrainMetric.
  • metric_params: Optionally, if the metric calls for any additional parameters, you should provide them here.
  • api: The API you want to use with your evaluator, set to openai by default. Another supported API is uptrain. Check out the UpTrain docs for any changes to supported APIs.
  • api_key: By default, this component looks for an environment variable called OPENAI_API_KEY. To change this, pass Secret.from_env_var("YOUR_ENV_VAR") to this parameter.

Usage

To use the UpTrainEvaluatoryou first need to install the integration:

pip install uptrain-haystack

To use the UpTrainEvaluator you need to follow these steps:

  1. Initialize the UpTrainEvaluator while providing the correct metric_params for the metric you are using.
  2. Run the UpTrainEvaluator, either on its own or in a Pipeline, by providing the expected input for the metric you are using.

Examples

Evaluate Context Relevance

To create a context relevance evaluation Pipeline:

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_KEY'

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.CONTEXT_RELEVANCE,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)

To run the evaluation Pipeline, you should have the expected inputs for the metric ready at hand. This metric expects a list of questions and contexts, these should come from the results of the Pipeline you want to evaluate.

results = evaluator_pipeline.run({"evaluator": {"questions": ["When was the Rhodes Statue built?", "Where is the Pyramid of Giza?"], 
                                                "contexts": ["[["Context for question 1", "Context 1"], ["Context for question 2"]}})
2"]]}})

Critique Tone

To create an evaluation Pipeline that critiques tone which critiques whether the tone of the response is “informative”:

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

os.eviron['OPENAI_API_KEY'] = 'YOUR_OPENAI_KEY'

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.CRITIQUE_TONE,
    api="openai",
    metric_params={"llm_persona": "informative"}
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)

To run this evaluation Pipeline, you should have the expected inputs for the metric ready at hand. This metric expects a list of responses which should come from the results of the Pipeline you want to evaluate.

evaluation_results = evaluator_pipeline.run({"evaluator": {"responses": ["The Rhodes Statue was built in 280 BC."]}})

Related Links

Check out the API reference: