Represents the results of evaluation.
Module eval_run_result
BaseEvaluationRunResult
Represents the results of an evaluation run.
BaseEvaluationRunResult.to_pandas
@abstractmethod
def to_pandas() -> "DataFrame"
Creates a Pandas DataFrame containing the scores of each metric for every input sample.
Returns:
Pandas DataFrame with the scores.
BaseEvaluationRunResult.score_report
@abstractmethod
def score_report() -> "DataFrame"
Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.
Returns:
Pandas DataFrame with the aggregated scores.
BaseEvaluationRunResult.comparative_individual_scores_report
@abstractmethod
def comparative_individual_scores_report(
other: "BaseEvaluationRunResult") -> "DataFrame"
Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.
The inputs to both evaluation runs is assumed to be the same.
Arguments:
other
: Results of another evaluation run to compare with.
Returns:
Pandas DataFrame with the score comparison.
EvaluationRunResult
Contains the inputs and the outputs of an evaluation pipeline and provides methods to inspect them.
EvaluationRunResult.__init__
def __init__(run_name: str, inputs: Dict[str, List[Any]],
results: Dict[str, Dict[str, Any]])
Initialize a new evaluation run result.
Arguments:
run_name
: Name of the evaluation run.inputs
: Dictionary containing the inputs used for the run. Each key is the name of the input and its value is a list of input values. The length of the lists should be the same.results
: Dictionary containing the results of the evaluators used in the evaluation pipeline. Each key is the name of the metric and its value is dictionary with the following keys: - 'score': The aggregated score for the metric. - 'individual_scores': A list of scores for each input sample.