Evaluation harness for Haystack.
Module haystack_experimental.evaluation.harness.evaluation_harness
EvaluationRunOverrides
Overrides for an evaluation run.
Used to override the init parameters of components in either (or both) the evaluated and evaluation pipelines. Each key is a component name and its value a dictionary with init parameters to override.
Arguments:
evaluated_pipeline_overrides
: Overrides for the evaluated pipeline.evaluation_pipeline_overrides
: Overrides for the evaluation pipeline.
EvaluationHarness
Executes a pipeline with a given set of parameters, inputs and evaluates its outputs with an evaluation pipeline.
EvaluationHarness.run
@abstractmethod
def run(inputs: EvalRunInputT,
*,
overrides: Optional[EvalRunOverridesT] = None,
run_name: Optional[str] = None) -> EvalRunOutputT
Launch a evaluation run.
Arguments:
inputs
: Inputs to the evaluated and evaluation pipelines.overrides
: Overrides for the harness.run_name
: A name for the evaluation run.
Returns:
The output of the evaluation pipeline.
Module haystack_experimental.evaluation.harness.rag.harness
DefaultRAGArchitecture
Represents default RAG pipeline architectures that can be used with the evaluation harness.
EMBEDDING_RETRIEVAL
A RAG pipeline with:
- A query embedder component named 'query_embedder' with a 'text' input.
- A document retriever component named 'retriever' with a 'documents' output.
KEYWORD_RETRIEVAL
A RAG pipeline with:
- A document retriever component named 'retriever' with a 'query' input and a 'documents' output.
GENERATION_WITH_EMBEDDING_RETRIEVAL
A RAG pipeline with:
- A query embedder component named 'query_embedder' with a 'text' input.
- A document retriever component named 'retriever' with a 'documents' output.
- A response generator component named 'generator' with a 'replies' output.
GENERATION_WITH_KEYWORD_RETRIEVAL
A RAG pipeline with:
- A document retriever component named 'retriever' with a 'query' input and a 'documents' output.
- A response generator component named 'generator' with a 'replies' output.
DefaultRAGArchitecture.expected_components
@property
def expected_components(
) -> Dict[RAGExpectedComponent, RAGExpectedComponentMetadata]
Returns the expected components for the architecture.
Returns:
The expected components.
RAGEvaluationHarness
Evaluation harness for evaluating RAG pipelines.
RAGEvaluationHarness.__init__
def __init__(rag_pipeline: Pipeline,
rag_components: Union[
DefaultRAGArchitecture,
Dict[RAGExpectedComponent, RAGExpectedComponentMetadata],
],
metrics: Set[RAGEvaluationMetric],
*,
progress_bar: bool = True)
Create an evaluation harness for evaluating basic RAG pipelines.
Arguments:
rag_pipeline
: The RAG pipeline to evaluate.rag_components
: Either a default RAG architecture or a mapping of expected components to their metadata.metrics
: The metrics to use during evaluation.progress_bar
: Whether to display a progress bar during evaluation.
Module haystack_experimental.evaluation.harness.rag.parameters
RAGExpectedComponent
Represents the basic components in a RAG pipeline that are, by default, required to be present for evaluation.
Each of these can be separate components in the pipeline or a single component that performs multiple tasks.
QUERY_PROCESSOR
The component in a RAG pipeline that accepts the user query.
Expected inputs: query
- Name of input that contains the query string.
DOCUMENT_RETRIEVER
The component in a RAG pipeline that retrieves documents based on the query.
Expected outputs: retrieved_documents
- Name of output containing retrieved documents.
RESPONSE_GENERATOR
The component in a RAG pipeline that generates responses based on the query and the retrieved documents.
Can be optional if the harness is only evaluating retrieval.
Expected outputs: replies
- Name of out containing the LLM responses. Only the first response is used.
RAGExpectedComponentMetadata
Metadata for a RAGExpectedComponent
.
Arguments:
name
: Name of the component in the pipeline.input_mapping
: Mapping of the expected inputs to corresponding component input names.output_mapping
: Mapping of the expected outputs to corresponding component output names.
RAGEvaluationMetric
Represents the metrics that can be used to evaluate a RAG pipeline.
DOCUMENT_MAP
Document Mean Average Precision. Required RAG components: Query Processor, Document Retriever.
DOCUMENT_MRR
Document Mean Reciprocal Rank. Required RAG components: Query Processor, Document Retriever.
DOCUMENT_RECALL_SINGLE_HIT
Document Recall with a single hit. Required RAG components: Query Processor, Document Retriever.
DOCUMENT_RECALL_MULTI_HIT
Document Recall with multiple hits. Required RAG components: Query Processor, Document Retriever.
SEMANTIC_ANSWER_SIMILARITY
Semantic Answer Similarity. Required RAG components: Query Processor, Response Generator.
FAITHFULNESS
Faithfulness. Required RAG components: Query Processor, Document Retriever, Response Generator.
CONTEXT_RELEVANCE
Context Relevance. Required RAG components: Query Processor, Document Retriever.
RAGEvaluationInput
Input passed to the RAG evaluation harness.
Arguments:
queries
: The queries passed to the RAG pipeline.ground_truth_documents
: The ground truth documents passed to the evaluation pipeline. Only required for metrics that require them. Corresponds to the queries.ground_truth_answers
: The ground truth answers passed to the evaluation pipeline. Only required for metrics that require them. Corresponds to the queries.rag_pipeline_inputs
: Additional inputs to pass to the RAG pipeline. Each key is the name of the component and its value a dictionary with the input name and a list of values, each corresponding to a query.
RAGEvaluationOverrides
Overrides for a RAG evaluation run.
Used to override the init parameters of components in either (or both) the evaluated and evaluation pipelines.
Arguments:
rag_pipeline
: Overrides for the RAG pipeline. Each key is a component name and its value a dictionary with init parameters to override.eval_pipeline
: Overrides for the evaluation pipeline. Each key is a RAG metric and its value a dictionary with init parameters to override.
RAGEvaluationOutput
Represents the output of a RAG evaluation run.
Arguments:
evaluated_pipeline
: Serialized version of the evaluated pipeline, including overrides.evaluation_pipeline
: Serialized version of the evaluation pipeline, including overrides.inputs
: Input passed to the evaluation harness.results
: Results of the evaluation run.