UpTrainEvaluator
The UpTrainEvaluator evaluates Haystack Pipelines using LLM-based metrics. It supports metrics like context relevance, factual accuracy, response relevance, and more.
Name | UpTrainEvaluator |
Path | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/uptrain |
Most common Position in a Pipeline | On its own or in an Evaluation Pipeline. To be used after a separate Pipeline has generated the inputs for the evaluator. |
Mandatory Input variables | “inputs”: A keyword arguments dictionary containing the expected inputs. The expected inputs will change based on what metric you are evaluating. See below for more details |
Output variables | “results”: a nested list of metric results. There can be one or more results, depending on the metric. Each result is a dictionary containing: - name - The name of the metric.- score - The score of the metric.- explanation - An optional explanation of the score. |
UpTrain is an evaluation framework that provides a number of LLM-based evaluation metrics. You can use the UpTrainEvaluator
component to evaluate a Haystack Pipeline, such as a retrieval-augmented generated Pipeline, against one of the metrics provided by UpTrain.
Supported Metrics
UpTrain supports a number of metrics which we expose through the UpTrainMetric
enumeration. Below is the list of metrics supported by the UpTrainEvaluator
in Haystack with the expected metric_params
while initializing the evaluator.
For a complete guide on these metrics, visit the UpTrain documentation.
Metric | Metric Parameters | Expected inputs | Metric description |
---|---|---|---|
CONTEXT_RELEVANCE | None | questions:List[str] contexts:List[List[str]] | Grades how relevant the context was to the question specified. |
FACTUAL_ACCURACY | None | questions: List[str] contexts:List[List[str]] responses:List[str] | Grades how factual the generated response was. |
RESPONSE_RELEVANCE | None | questions:List[str] responses:List[str] | Grades how relevant the generated response is or if it has any additional irrelevant information for the question asked. |
RESPONSE_COMPLETENESS | None | questions:List[str] responses:List[str] | Grades how complete the generated response was for the question specified. |
RESPONSE_COMPLETENESS_WRT_CONTEXT | None | questions:List[str] contexts:List[List[str]] responses:List[str] | Grades how complete the generated response was for the question specified given the information provided in the context. |
RESPONSE_CONSISTENCY | None | questions:List[str] contexts:List[List[str]] responses:List[str] | Grades how consistent the response is with the question asked as well as with the context provided. |
RESPONSE_CONCISENESS | None | questions:List[str] responses:List[str] | Grades how concise the generated response is or if it has any additional irrelevant information for the question asked. |
CRITIQUE_LANGUAGE | None | responses:List[str] | Evaluate the response on multiple aspects - fluency, politeness, grammar, and coherence. It provides a score for each of the aspects on a scale of 0 to 1, along with an explanation for the score. |
CRITIQUE_TONE | llm_persona | responses:List[str] | Operator to assess the tone of machine generated responses. |
GUIDELINE_ADHERENCE | guideline guideline_name guideline_schema | questions:List[str] responses:List[str] | Grades how well the LLM adheres to a provided guideline when giving a response. |
RESPONSE_MATCHING | method | responses:List[str] ground_truths:List[str] | Operator to compare the LLM-generated text with the gold (ideal) response using the defined score metric. |
Parameters Overview
To initialize a UpTrainEvaluator
you need to provide the following parameters :
metric
: AnUpTrainMetric
.metric_params
: Optionally, if the metric calls for any additional parameters, you should provide them here.api
: The API you want to use with your evaluator, set toopenai
by default. Another supported API isuptrain
. Check out the UpTrain docs for any changes to supported APIs.api_key
: By default, this component looks for an environment variable calledOPENAI_API_KEY
. To change this, passSecret.from_env_var("YOUR_ENV_VAR")
to this parameter.
Usage
To use the UpTrainEvaluator
you first need to install the integration:
pip install uptrain-haystack
To use the UpTrainEvaluator
you need to follow these steps:
- Initialize the
UpTrainEvaluator
while providing the correctmetric_params
for the metric you are using. - Run the
UpTrainEvaluator
, either on its own or in a Pipeline, by providing the expected input for the metric you are using.
Examples
Evaluate Context Relevance
To create a context relevance evaluation Pipeline:
import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric
os.environ['OPENAI_API_KEY'] = 'YOUR_OPENAI_KEY'
evaluator = UpTrainEvaluator(
metric=UpTrainMetric.CONTEXT_RELEVANCE,
api="openai",
)
evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
To run the evaluation Pipeline, you should have the expected inputs for the metric ready at hand. This metric expects a list of questions
and contexts
, these should come from the results of the Pipeline you want to evaluate.
results = evaluator_pipeline.run({"evaluator": {"questions": ["When was the Rhodes Statue built?", "Where is the Pyramid of Giza?"],
"contexts": ["[["Context for question 1", "Context 1"], ["Context for question 2"]}})
2"]]}})
Critique Tone
To create an evaluation Pipeline that critiques tone which critiques whether the tone of the response is “informative”:
import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric
os.eviron['OPENAI_API_KEY'] = 'YOUR_OPENAI_KEY'
evaluator = UpTrainEvaluator(
metric=UpTrainMetric.CRITIQUE_TONE,
api="openai",
metric_params={"llm_persona": "informative"}
)
evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
To run this evaluation Pipeline, you should have the expected inputs for the metric ready at hand. This metric expects a list of responses
which should come from the results of the Pipeline you want to evaluate.
evaluation_results = evaluator_pipeline.run({"evaluator": {"responses": ["The Rhodes Statue was built in 280 BC."]}})
Updated 8 months ago