DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
API Reference

Represents the results of evaluation.

Module base

BaseEvaluationRunResult

Represents the results of an evaluation run.

BaseEvaluationRunResult.to_pandas

@abstractmethod
def to_pandas() -> "DataFrame"

Creates a Pandas DataFrame containing the scores of each metric for every input sample.

Returns:

Pandas DataFrame with the scores.

BaseEvaluationRunResult.score_report

@abstractmethod
def score_report() -> "DataFrame"

Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.

Returns:

Pandas DataFrame with the aggregated scores.

BaseEvaluationRunResult.comparative_individual_scores_report

@abstractmethod
def comparative_individual_scores_report(
        other: "BaseEvaluationRunResult",
        keep_columns: Optional[List[str]] = None) -> "DataFrame"

Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.

The inputs to both evaluation runs is assumed to be the same.

Arguments:

  • other: Results of another evaluation run to compare with.
  • keep_columns: List of common column names to keep from the inputs of the evaluation runs to compare.

Returns:

Pandas DataFrame with the score comparison.

Module eval_run_result

EvaluationRunResult

Contains the inputs and the outputs of an evaluation pipeline and provides methods to inspect them.

EvaluationRunResult.__init__

def __init__(run_name: str, inputs: Dict[str, List[Any]],
             results: Dict[str, Dict[str, Any]])

Initialize a new evaluation run result.

Arguments:

  • run_name: Name of the evaluation run.
  • inputs: Dictionary containing the inputs used for the run. Each key is the name of the input and its value is a list of input values. The length of the lists should be the same.
  • results: Dictionary containing the results of the evaluators used in the evaluation pipeline. Each key is the name of the metric and its value is dictionary with the following keys: - 'score': The aggregated score for the metric. - 'individual_scores': A list of scores for each input sample.

EvaluationRunResult.score_report

def score_report() -> DataFrame

Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.

Returns:

Pandas DataFrame with the aggregated scores.

EvaluationRunResult.to_pandas

def to_pandas() -> DataFrame

Creates a Pandas DataFrame containing the scores of each metric for every input sample.

Returns:

Pandas DataFrame with the scores.

EvaluationRunResult.comparative_individual_scores_report

def comparative_individual_scores_report(
        other: "BaseEvaluationRunResult",
        keep_columns: Optional[List[str]] = None) -> DataFrame

Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.

The inputs to both evaluation runs is assumed to be the same.

Arguments:

  • other: Results of another evaluation run to compare with.
  • keep_columns: List of common column names to keep from the inputs of the evaluation runs to compare.

Returns:

Pandas DataFrame with the score comparison.