LLMEvaluator
This Evaluator uses an LLM to evaluate inputs based on a prompt containing user-defined instructions and examples.
Most common position in a pipeline | On its own or in an evaluation pipeline. To be used after a separate pipeline that has generated the inputs for the Evaluator. |
Mandatory init variables | "instructions": The prompt instructions string "inputs": The expected inputs "outputs": The output names of the evaluation results "examples": Few-shot examples conforming to the input and output format |
Mandatory run variables | “inputs”: Defined by the user – for example, questions or responses |
Output variables | “results”: A dictionary containing keys defined by the user, such as score |
API reference | Evaluators |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/evaluators/llm_evaluator.py |
Overview
The LLMEvaluator
component can evaluate answers, documents, or any other outputs of a Haystack pipeline based on a user-defined aspect. The component combines the instructions, examples, and expected output names into one prompt. It is meant for calculating user-defined model-based evaluation metrics. If you are looking for pre-defined model-based evaluators that work out of the box, have a look at Haystack’s FaithfulnessEvaluator
and ContextRelevanceEvaluator
components instead.
Parameters
The default model for this Evaluator is gpt-4o-mini
. You can override the model using the api_params
key during initialization.
A valid OpenAI API key must be set as an OPENAI_API_KEY
environment variable. The LLMEvaluator
's api_key
parameter allows you to provide the API key in a different way. For details, see our documentation page on secret management.
LLMEvaluator
requires six parameters for initialization:
instructions
: The prompt instructions to use for evaluation, such as a question about the inputs that the LLM can answer with yes, no, or a score.inputs
: The inputs that theLLMEvaluator
expects and that it evaluates. The inputs determine the incoming connections of the component. Each input is a tuple of an input name and input type. Input types must be lists. An example could be[("responses", List[str])]
.outputs
: Output names of the evaluation results corresponding to keys in the output dictionary. An example could be a["score"]
.examples
: Use this parameter to pass few-shot examples conforming to the expected input and output format. These examples are included in the prompt that is sent to the LLM. Examples increase the number of tokens of the prompt and make each request more costly. Adding more than one or two examples can be helpful if you want to improve the quality of the evaluation at the cost of more tokens.raise_on_failure
: If True, raise an exception on an unsuccessful API call.progress_bar
: Whether to show a progress bar during the evaluation.
Each example must be a dictionary with keys inputs
and outputs
.
inputs
must be a dictionary with keys questions
and contexts
.
outputs
must be a dictionary with statements
and statement_scores
.
Here is the expected format:
[{
"inputs": {
"questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
},
"outputs": {
"statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
"statement_scores": [1, 0],
},
}]
Usage
On its own
Below is an example where we use an LLMEvaluator
component to evaluate a generated response. The aspect we evaluate is whether the response is problematic for children as defined in the instructions. The LLMEvaluator
returns one binary score per input response with the result that both responses are not problematic.
from typing import List
from haystack.components.evaluators import LLMEvaluator
llm_evaluator = LLMEvaluator(
instructions="Is this answer problematic for children?",
inputs=[("responses", List[str])],
outputs=["score"],
examples=[
{"inputs": {"responses": "Damn, this is straight outta hell!!!"}, "outputs": {"score": 1}},
{"inputs": {"responses": "Football is the most popular sport."}, "outputs": {"score": 0}},
],
)
responses = [
"Football is the most popular sport with around 4 billion followers worldwide",
"Python language was created by Guido van Rossum.",
]
results = llm_evaluator.run(responses=responses)
print(results)
# {'results': [{'score': 0}, {'score': 0}]}
In a pipeline
Below is an example where we use an LLMEvaluator
in a pipeline to evaluate a response.
from typing import List
from haystack import Pipeline
from haystack.components.evaluators import LLMEvaluator
pipeline = Pipeline()
llm_evaluator = LLMEvaluator(
instructions="Is this answer problematic for children?",
inputs=[("responses", List[str])],
outputs=["score"],
examples=[
{"inputs": {"responses": "Damn, this is straight outta hell!!!"}, "outputs": {"score": 1}},
{"inputs": {"responses": "Football is the most popular sport."}, "outputs": {"score": 0}},
],
)
pipeline.add_component("llm_evaluator", llm_evaluator)
responses = [
"Football is the most popular sport with around 4 billion followers worldwide",
"Python language was created by Guido van Rossum.",
]
result = pipeline.run(
{
"llm_evaluator": {"responses": responses}
}
)
for evaluator in result:
print(result[evaluator]["results"])
# [{'score': 0}, {'score': 0}]
Updated about 1 month ago