DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Utility functions for Haystack.

Module doc_store

launch_es

def launch_es(sleep=15,
              delete_existing=False,
              java_opts: Optional[str] = None)

Start an Elasticsearch server via Docker.

launch_opensearch

def launch_opensearch(sleep=15,
                      delete_existing=False,
                      local_port=9200,
                      java_opts: Optional[str] = None)

Start an OpenSearch server via Docker.

launch_weaviate

def launch_weaviate(sleep=15, delete_existing=False)

Start a Weaviate server via Docker.

Module export_utils

print_answers

def print_answers(results: dict,
                  details: str = "all",
                  max_text_len: Optional[int] = None)

Utility function to print results of Haystack pipelines

Arguments:

  • results: Results that the pipeline returned.
  • details: Defines the level of details to print. Possible values: minimum, medium, all.
  • max_text_len: Specifies the maximum allowed length for a text field. If you don't want to shorten the text, set this value to None.

Returns:

None

print_documents

def print_documents(results: dict,
                    max_text_len: Optional[int] = None,
                    print_name: bool = True,
                    print_meta: bool = False)

Utility that prints a compressed representation of the documents returned by a pipeline.

Arguments:

  • max_text_len: Shorten the document's content to a maximum number of characters. When set to None, the document is not shortened.
  • print_name: Whether to print the document's name from the metadata.
  • print_meta: Whether to print the document's metadata.

print_questions

def print_questions(results: dict)

Utility to print the output of a question generating pipeline in a readable format.

export_answers_to_csv

def export_answers_to_csv(agg_results: list, output_file)

Exports answers coming from finder.get_answers() to a CSV file.

Arguments:

  • agg_results: A list of predictions coming from finder.get_answers().
  • output_file: The name of the output file.

Returns:

None

convert_labels_to_squad

def convert_labels_to_squad(labels_file: str)

Convert the export from the labeling UI to the SQuAD format for training.

Arguments:

  • labels_file: The path to the file containing labels.

Module preprocessing

convert_files_to_docs

def convert_files_to_docs(
        dir_path: Optional[str] = None,
        clean_func: Optional[Callable] = None,
        split_paragraphs: bool = False,
        encoding: Optional[str] = None,
        id_hash_keys: Optional[List[str]] = None,
        file_paths: Optional[List[Path]] = None) -> List[Document]

Convert files (.txt, .pdf, .docx) to Documents that can be written to a Document Store.

Files can be specified by giving a directory path, a list of file paths, or both. If a directory path is given then all files with the allowed suffixes in the directory's subdirectories will be converted.

Arguments:

  • dir_path: The path of a directory that contains Files to be converted, including in its subdirectories.
  • clean_func: A custom cleaning function that gets applied to each Document (input: str, output: str).
  • split_paragraphs: Whether to split text by paragraph.
  • encoding: Character encoding to use when converting pdf documents.
  • id_hash_keys: A list of Document attribute names from which the Document ID should be hashed from. Useful for generating unique IDs even if the Document contents are identical. To ensure you don't have duplicate Documents in your Document Store if texts are not unique, you can modify the metadata and pass ["content", "meta"] to this field. If you do this, the Document ID will be generated by using the content and the defined metadata.
  • file_paths: A list of paths of Files to be converted.

tika_convert_files_to_docs

def tika_convert_files_to_docs(
        dir_path: Optional[str] = None,
        clean_func: Optional[Callable] = None,
        split_paragraphs: bool = False,
        merge_short: bool = True,
        merge_lowercase: bool = True,
        id_hash_keys: Optional[List[str]] = None,
        file_paths: Optional[List[Path]] = None) -> List[Document]

Convert files (.txt, .pdf) to Documents that can be written to a Document Store.

Files can be specified by giving a directory path, a list of file paths, or both. If a directory path is given then all files with the allowed suffixes in the directory's subdirectories will be converted.

Arguments:

  • merge_lowercase: Whether to convert merged paragraphs to lowercase.
  • merge_short: Whether to allow merging of short paragraphs
  • dir_path: The path of a directory that contains Files to be converted, including in its subdirectories.
  • clean_func: A custom cleaning function that gets applied to each doc (input: str, output:str).
  • split_paragraphs: Whether to split text by paragraphs.
  • id_hash_keys: A list of Document attribute names from which the Document ID should be hashed from. Useful for generating unique IDs even if the Document contents are identical. To ensure you don't have duplicate Documents in your Document Store if texts are not unique, you can modify the metadata and pass ["content", "meta"] to this field. If you do this, the Document ID will be generated by using the content and the defined metadata.
  • file_paths: A list of paths of Files to be converted.

Module squad_data

SquadData

class SquadData()

This class is designed to manipulate data that is in SQuAD format

SquadData.__init__

def __init__(squad_data)

Arguments:

  • squad_data: SQuAD format data, either as a dictionary with a data key, or just a list of SQuAD documents.

SquadData.merge_from_file

def merge_from_file(filename: str)

Merge the contents of a JSON file in the SQuAD format with the data stored in this object.

SquadData.merge

def merge(new_data: List)

Merge data in SQuAD format with the data stored in this object.

Arguments:

  • new_data: A list of SQuAD document data.

SquadData.from_file

@classmethod
def from_file(cls, filename: str)

Create a SquadData object by providing the name of a JSON file in the SQuAD format.

SquadData.save

def save(filename: str)

Write the data stored in this object to a JSON file.

SquadData.to_document_objs

def to_document_objs()

Export all paragraphs stored in this object to haystack.Document objects.

SquadData.to_label_objs

def to_label_objs(answer_type="generative")

Export all labels stored in this object to haystack.Label objects

SquadData.to_df

@staticmethod
def to_df(data)

Convert a list of SQuAD document dictionaries into a pandas dataframe (each row is one annotation).

SquadData.count

def count(unit="questions")

Count the samples in the data. Choose a unit: "paragraphs", "questions", "answers", "no_answers", "span_answers".

SquadData.df_to_data

@classmethod
def df_to_data(cls, df)

Convert a data frame into the SQuAD format data (list of SQuAD document dictionaries).

SquadData.sample_questions

def sample_questions(n)

Return a sample of n questions in the SQuAD format (a list of SQuAD document dictionaries). Note that if the same question is asked on multiple different passages, this function treats that as a single question.

SquadData.get_all_paragraphs

def get_all_paragraphs()

Return all paragraph strings.

SquadData.get_all_questions

def get_all_questions()

Return all question strings. Note that if the same question appears for different paragraphs, this function returns it multiple times.

SquadData.get_all_document_titles

def get_all_document_titles()

Return all document title strings.

Module early_stopping

EarlyStopping

class EarlyStopping()

An object you can to control early stopping with a Node's train() method or a Trainer class. You can use a custom EarlyStopping class instead as long as it implements the method check_stopping() and provides the attribute save_dir.

EarlyStopping.__init__

def __init__(head: int = 0,
             metric: Union[str, Callable] = "loss",
             save_dir: Optional[str] = None,
             mode: Literal["min", "max"] = "min",
             patience: int = 0,
             min_delta: float = 0.001,
             min_evals: int = 0)

Arguments:

  • head: The index of the prediction head that you are evaluating to determine the chosen metric. In Haystack, the large majority of the models are trained from the loss signal of a single prediction head so the default value of 0 should work in most cases.
  • save_dir: The directory where to save the final best model. If you set it to None, the model is not saved.
  • metric: The name of a dev set metric to monitor (default: loss) which is extracted from the prediction head specified by the variable head, or a function that extracts a value from the trainer dev evaluation result. For FARMReader training, some available metrics to choose from are "EM", "f1", and "top_n_accuracy". For DensePassageRetriever training, some available metrics to choose from are "acc", "f1", and "average_rank". NOTE: This is different from the metric that is specified in the Processor which defines how to calculate one or more evaluation metric values from the prediction and target sets. The metric variable in this function specifies the name of one particular metric value, or it is a method to calculate a value from the result returned by the Processor metric.
  • mode: When set to "min", training stops if the metric does not continue to decrease. When set to "max", training stops if the metric does not continue to increase.
  • patience: How many evaluations with no improvement to perform before stopping training.
  • min_delta: Minimum difference to the previous best value to count as an improvement.
  • min_evals: Minimum number of evaluations to perform before checking that the evaluation metric is improving.

EarlyStopping.check_stopping

def check_stopping(eval_result: List[Dict]) -> Tuple[bool, bool, float]

Provides the evaluation value for the current evaluation. Returns true if stopping should occur.

This saves the model if you provided self.save_dir when initializing EarlyStopping.

Arguments:

  • eval_result: The current evaluation result which consists of a list of dictionaries, one for each prediction head. Each dictionary contains the metrics and reports generated during evaluation.

Returns:

A tuple (stopprocessing, savemodel, eval_value) indicating if processing should be stopped and if the current model should get saved and the evaluation value used.

Module cleaning

clean_wiki_text

def clean_wiki_text(text: str) -> str

Clean wikipedia text by removing multiple new lines, removing extremely short lines, adding paragraph breaks and removing empty paragraphs

Module context_matching

calculate_context_similarity

def calculate_context_similarity(context: str,
                                 candidate: str,
                                 min_length: int = 100,
                                 boost_split_overlaps: bool = True) -> float

Calculates the text similarity score of context and candidate.

The score's value ranges between 0.0 and 100.0.

Arguments:

  • context: The context to match.
  • candidate: The candidate to match the context.
  • min_length: The minimum string length context and candidate need to have in order to be scored. Returns 0.0 otherwise.
  • boost_split_overlaps: Whether to boost split overlaps (e.g. [AB] <-> [BC]) that result from different preprocessing params. If we detect that the score is near a half match and the matching part of the candidate is at its boundaries we cut the context on the same side, recalculate the score and take the mean of both. Thus [AB] <-> [BC] (score ~50) gets recalculated with B <-> B (score ~100) scoring ~75 in total.

match_context

def match_context(
        context: str,
        candidates: Generator[Tuple[str, str], None, None],
        threshold: float = 65.0,
        show_progress: bool = False,
        num_processes: Optional[int] = None,
        chunksize: int = 1,
        min_length: int = 100,
        boost_split_overlaps: bool = True) -> List[Tuple[str, float]]

Matches the context against multiple candidates. Candidates consist of a tuple of an id and its text.

Returns a sorted list of the candidate ids and its scores filtered by the threshold in descending order.

Arguments:

  • context: The context to match.
  • candidates: The candidates to match the context. A candidate consists of a tuple of candidate id and candidate text.
  • threshold: Score threshold that candidates must surpass to be included into the result list.
  • show_progress: Whether to show the progress of matching all candidates.
  • num_processes: The number of processes to be used for matching in parallel.
  • chunksize: The chunksize used during parallel processing. If not specified chunksize is 1. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
  • min_length: The minimum string length context and candidate need to have in order to be scored. Returns 0.0 otherwise.
  • boost_split_overlaps: Whether to boost split overlaps (e.g. [AB] <-> [BC]) that result from different preprocessing params. If we detect that the score is near a half match and the matching part of the candidate is at its boundaries we cut the context on the same side, recalculate the score and take the mean of both. Thus [AB] <-> [BC] (score ~50) gets recalculated with B <-> B (score ~100) scoring ~75 in total.

match_contexts

def match_contexts(
        contexts: List[str],
        candidates: Generator[Tuple[str, str], None, None],
        threshold: float = 65.0,
        show_progress: bool = False,
        num_processes: Optional[int] = None,
        chunksize: int = 1,
        min_length: int = 100,
        boost_split_overlaps: bool = True) -> List[List[Tuple[str, float]]]

Matches the contexts against multiple candidates. Candidates consist of a tuple of an id and its string text.

This method iterates over candidates only once.

Returns for each context a sorted list of the candidate ids and its scores filtered by the threshold in descending order.

Arguments:

  • contexts: The contexts to match.
  • candidates: The candidates to match the context. A candidate consists of a tuple of candidate id and candidate text.
  • threshold: Score threshold that candidates must surpass to be included into the result list.
  • show_progress: Whether to show the progress of matching all candidates.
  • num_processes: The number of processes to be used for matching in parallel.
  • chunksize: The chunksize used during parallel processing. If not specified chunksize is 1. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
  • min_length: The minimum string length context and candidate need to have in order to be scored. Returns 0.0 otherwise.
  • boost_split_overlaps: Whether to boost split overlaps (e.g. [AB] <-> [BC]) that result from different preprocessing params. If we detect that the score is near a half match and the matching part of the candidate is at its boundaries we cut the context on the same side, recalculate the score and take the mean of both. Thus [AB] <-> [BC] (score ~50) gets recalculated with B <-> B (score ~100) scoring ~75 in total.

Module deepsetcloud

DeepsetCloudError

class DeepsetCloudError(Exception)

Raised when there is an error communicating with deepset Cloud

DeepsetCloudClient

class DeepsetCloudClient()

DeepsetCloudClient.__init__

def __init__(api_key: Optional[str] = None,
             api_endpoint: Optional[str] = None)

A client to communicate with deepset Cloud.

Arguments:

  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

IndexClient

class IndexClient()

IndexClient.__init__

def __init__(client: DeepsetCloudClient,
             workspace: Optional[str] = None,
             index: Optional[str] = None)

A client to communicate with deepset Cloud indexes.

Arguments:

  • client: deepset Cloud client
  • workspace: Specifies the name of the workspace for which you want to create the client.
  • index: index in deepset Cloud workspace

PipelineClient

class PipelineClient()

PipelineClient.__init__

def __init__(client: DeepsetCloudClient,
             workspace: Optional[str] = None,
             pipeline_config_name: Optional[str] = None)

A client to communicate with deepset Cloud pipelines.

Arguments:

  • client: deepset Cloud client
  • workspace: Specifies the name of the workspace for which you want to create the client.
  • pipeline_config_name: Name of the pipeline_config in deepset Cloud workspace.

PipelineClient.get_pipeline_config

def get_pipeline_config(workspace: Optional[str] = None,
                        pipeline_config_name: Optional[str] = None,
                        headers: Optional[dict] = None) -> dict

Gets the config from a pipeline on deepset Cloud.

Arguments:

  • pipeline_config_name: Name of the pipeline_config in deepset Cloud workspace.
  • workspace: Specifies the name of the workspace on deepset Cloud.
  • headers: Headers to pass to the API call.

PipelineClient.get_pipeline_config_info

def get_pipeline_config_info(workspace: Optional[str] = None,
                             pipeline_config_name: Optional[str] = None,
                             headers: Optional[dict] = None) -> Optional[dict]

Gets information about a pipeline on deepset Cloud.

Arguments:

  • pipeline_config_name: Name of the pipeline_config in deepset Cloud workspace.
  • workspace: Specifies the name of the workspace on deepset Cloud.
  • headers: Headers to pass to the API call.

PipelineClient.list_pipeline_configs

def list_pipeline_configs(workspace: Optional[str] = None,
                          headers: Optional[dict] = None) -> Generator

Lists all pipelines available on deepset Cloud.

Arguments:

  • workspace: Specifies the name of the workspace on deepset Cloud.

  • headers: Headers to pass to the API call. Returns: Generator of dictionaries: List[dict] each dictionary: { "name": str -> pipeline_config_name to be used in load_from_deepset_cloud(), "..." -> additional pipeline meta information } example:

    ```python
    [{'name': 'my_super_nice_pipeline_config',
        'pipeline_id': '2184e0c1-c6ec-40a1-9b28-5d2768e5efa2',
        'status': 'DEPLOYED',
        'created_at': '2022-02-01T09:57:03.803991+00:00',
        'deleted': False,
        'is_default': False,
        'indexing': {'status': 'IN_PROGRESS',
        'pending_file_count': 3,
        'total_file_count': 31}}]
    ```
    

PipelineClient.save_pipeline_config

def save_pipeline_config(config: dict,
                         pipeline_config_name: Optional[str] = None,
                         workspace: Optional[str] = None,
                         headers: Optional[dict] = None)

Saves a pipeline config to deepset Cloud.

Arguments:

  • config: The pipeline config to save.
  • pipeline_config_name: Name of the pipeline_config in deepset Cloud workspace.
  • workspace: Specifies the name of the workspace on deepset Cloud.
  • headers: Headers to pass to the API call.

PipelineClient.update_pipeline_config

def update_pipeline_config(config: dict,
                           pipeline_config_name: Optional[str] = None,
                           workspace: Optional[str] = None,
                           headers: Optional[dict] = None)

Updates a pipeline config on deepset Cloud.

Arguments:

  • config: The pipeline config to save.
  • pipeline_config_name: Name of the pipeline_config in deepset Cloud workspace.
  • workspace: Specifies the name of the workspace on deepset Cloud.
  • headers: Headers to pass to the API call.

PipelineClient.deploy

def deploy(pipeline_config_name: Optional[str] = None,
           workspace: Optional[str] = None,
           headers: Optional[dict] = None,
           timeout: int = 60,
           show_curl_message: bool = True)

Deploys the pipelines of a pipeline config on deepset Cloud.

Blocks until pipelines are successfully deployed, deployment failed or timeout exceeds. If pipelines are already deployed no action will be taken and an info will be logged. If timeout exceeds a TimeoutError will be raised. If deployment fails a DeepsetCloudError will be raised.

Arguments:

  • pipeline_config_name: Name of the config file inside the deepset Cloud workspace.
  • workspace: Specifies the name of the workspace on deepset Cloud.
  • headers: Headers to pass to the API call.
  • timeout: The time in seconds to wait until deployment completes. If the timeout is exceeded an error will be raised.
  • show_curl_message: Whether to print an additional message after successful deployment showing how to query the pipeline using curl.

PipelineClient.undeploy

def undeploy(pipeline_config_name: Optional[str] = None,
             workspace: Optional[str] = None,
             headers: Optional[dict] = None,
             timeout: int = 60)

Undeploys the pipelines of a pipeline config on deepset Cloud.

Blocks until pipelines are successfully undeployed, undeployment failed or timeout exceeds. If pipelines are already undeployed no action will be taken and an info will be logged. If timeout exceeds a TimeoutError will be raised. If deployment fails a DeepsetCloudError will be raised.

Arguments:

  • pipeline_config_name: Name of the config file inside the deepset Cloud workspace.
  • workspace: Specifies the name of the workspace on deepset Cloud.
  • headers: Headers to pass to the API call
  • timeout: The time in seconds to wait until undeployment completes. If the timeout is exceeded an error will be raised.

EvaluationSetClient

class EvaluationSetClient()

EvaluationSetClient.__init__

def __init__(client: DeepsetCloudClient,
             workspace: Optional[str] = None,
             evaluation_set: Optional[str] = None)

A client to communicate with deepset Cloud evaluation sets and labels.

Arguments:

  • client: deepset Cloud client
  • workspace: Specifies the name of the workspace for which you want to create the client.
  • evaluation_set: name of the evaluation set to fall back to

EvaluationSetClient.get_labels

def get_labels(evaluation_set: Optional[str],
               workspace: Optional[str] = None) -> List[Label]

Searches for labels for a given evaluation set in deepset cloud. Returns a list of all found labels.

If no labels were found, raises DeepsetCloudError.

Arguments:

  • evaluation_set: name of the evaluation set for which labels should be fetched
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationSetClient's default workspace (self.workspace) is used.

Returns:

list of Label

EvaluationSetClient.get_labels_count

def get_labels_count(evaluation_set: Optional[str] = None,
                     workspace: Optional[str] = None) -> int

Counts labels for a given evaluation set in deepset cloud.

Arguments:

  • evaluation_set: Optional evaluation set in deepset Cloud If set to None, the EvaluationSetClient's default evaluation set (self.evaluation_set) is used.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationSetClient's default workspace (self.workspace) is used.

Returns:

Number of labels for the given (or defaulting) index

EvaluationSetClient.get_evaluation_sets

def get_evaluation_sets(workspace: Optional[str] = None) -> List[dict]

Searches for all evaluation set names in the given workspace in deepset Cloud.

Arguments:

  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationSetClient's default workspace (self.workspace) is used.

Returns:

List of dictionaries that represent deepset Cloud evaluation sets. These contain ("name", "evaluation_set_id", "created_at", "matched_labels", "total_labels") as fields.

EvaluationSetClient.upload_evaluation_set

def upload_evaluation_set(file_path: Path, workspace: Optional[str] = None)

Uploads an evaluation set.

The name of file that you uploaded becomes the name of the evaluation set in deepset Cloud. When using Haystack annotation tool make sure to choose CSV as export format. The resulting file matches the expected format.

Currently, deepset Cloud only supports CSV files (having "," as delimiter) with the following columns:

  • question (or query): the labelled question or query (required)
  • text: the answer to the question or relevant text to the query (required)
  • context: the surrounding words of the text (should be more than 100 characters) (optional)
  • file_name: the name of the file within the workspace that contains the text (optional)
  • answer_start: the character position within the file that marks the start of the text (optional)
  • answer_end: the character position within the file that marks the end of the text (optional)

Arguments:

  • file_path: Path to the evaluation set file to be uploaded.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationSetClient's default workspace (self.workspace) is used.

EvaluationSetClient.get_evaluation_set

def get_evaluation_set(
        evaluation_set: Optional[str] = None,
        workspace: Optional[str] = None) -> Optional[Dict[str, Any]]

Returns information about the evaluation set.

Arguments:

  • evaluation_set: Name of the evaluation set in deepset Cloud. If set to None, the EvaluationSetClient's default evaluation set (self.evaluation_set) is used.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationSetClient's default workspace (self.workspace) is used.

Returns:

Dictionary that represents deepset Cloud evaluation sets. These contain ("name", "evaluation_set_id", "created_at", "matched_labels", "total_labels") as fields.

FileClient

class FileClient()

FileClient.__init__

def __init__(client: DeepsetCloudClient, workspace: Optional[str] = None)

A client to manage files on deepset Cloud.

Arguments:

  • client: deepset Cloud client
  • workspace: Specifies the name of the workspace for which you want to create the client.

FileClient.upload_files

def upload_files(file_paths: List[Path],
                 metas: Optional[List[Dict]] = None,
                 workspace: Optional[str] = None,
                 headers: Optional[dict] = None)

Uploads files to the deepset Cloud workspace.

Arguments:

  • file_paths: File paths to upload (for example .txt or .pdf files)
  • metas: Metadata of the files to upload
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the FileClient's default workspace is used.
  • headers: Headers to pass to the API call

FileClient.delete_file

def delete_file(file_id: str,
                workspace: Optional[str] = None,
                headers: Optional[dict] = None)

Delete a file from the deepset Cloud workspace.

Arguments:

  • file_id: The id of the file to be deleted. Use list_files to retrieve the id of a file.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the FileClient's default workspace is used.
  • headers: Headers to pass to the API call

FileClient.delete_all_files

def delete_all_files(workspace: Optional[str] = None,
                     headers: Optional[dict] = None)

Delete all files from a deepset Cloud workspace.

Arguments:

  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the FileClient's default workspace is used.
  • headers: Headers to pass to the API call.

FileClient.list_files

def list_files(name: Optional[str] = None,
               meta_key: Optional[str] = None,
               meta_value: Optional[str] = None,
               workspace: Optional[str] = None,
               headers: Optional[dict] = None) -> Generator

List all files in the given deepset Cloud workspace.

You can filter by name or by meta values.

Arguments:

  • name: The name or part of the name of the file.
  • meta_key: The key of the metadata of the file to be filtered for.
  • meta_value: The value of the metadata of the file to be filtered for.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the FileClient's default workspace is used.
  • headers: Headers to pass to the API call

EvaluationRunClient

class EvaluationRunClient()

EvaluationRunClient.__init__

def __init__(client: DeepsetCloudClient, workspace: Optional[str] = None)

A client to manage deepset Cloud evaluation runs.

Arguments:

  • client: deepset Cloud client
  • workspace: Specifies the name of the workspace for which you want to create the client.

EvaluationRunClient.create_eval_run

def create_eval_run(eval_run_name: str,
                    workspace: Optional[str] = None,
                    pipeline_config_name: Optional[str] = None,
                    headers: Optional[dict] = None,
                    evaluation_set: Optional[str] = None,
                    eval_mode: Literal["integrated",
                                       "isolated"] = "integrated",
                    debug: bool = False,
                    comment: Optional[str] = None,
                    tags: Optional[List[str]] = None) -> Dict[str, Any]

Creates an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • pipeline_config_name: The name of the pipeline to evaluate.
  • evaluation_set: The name of the evaluation set to use.
  • eval_mode: The evaluation mode to use.
  • debug: Whether to enable debug output.
  • comment: Comment to add about to the evaluation run.
  • tags: Tags to add to the evaluation run.
  • headers: Headers to pass to the API call

EvaluationRunClient.get_eval_run

def get_eval_run(eval_run_name: str,
                 workspace: Optional[str] = None,
                 headers: Optional[dict] = None) -> Dict[str, Any]

Gets the evaluation run and shows its parameters and metrics.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • headers: Headers to pass to the API call

EvaluationRunClient.get_eval_runs

def get_eval_runs(workspace: Optional[str] = None,
                  headers: Optional[dict] = None) -> List[Dict[str, Any]]

Gets all evaluation runs and shows its parameters and metrics.

Arguments:

  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • headers: Headers to pass to the API call

EvaluationRunClient.delete_eval_run

def delete_eval_run(eval_run_name: str,
                    workspace: Optional[str] = None,
                    headers: Optional[dict] = None)

Deletes an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • headers: Headers to pass to the API call

EvaluationRunClient.start_eval_run

def start_eval_run(eval_run_name: str,
                   workspace: Optional[str] = None,
                   headers: Optional[dict] = None)

Starts an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • headers: Headers to pass to the API call

EvaluationRunClient.update_eval_run

def update_eval_run(eval_run_name: str,
                    workspace: Optional[str] = None,
                    pipeline_config_name: Optional[str] = None,
                    headers: Optional[dict] = None,
                    evaluation_set: Optional[str] = None,
                    eval_mode: Literal["integrated", "isolated", None] = None,
                    debug: Optional[bool] = None,
                    comment: Optional[str] = None,
                    tags: Optional[List[str]] = None) -> Dict[str, Any]

Updates an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run to update.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the FileClient's default workspace is used.
  • pipeline_config_name: The name of the pipeline to evaluate.
  • evaluation_set: The name of the evaluation set to use.
  • eval_mode: The evaluation mode to use.
  • debug: Whether to enable debug output.
  • comment: Comment to add about to the evaluation run.
  • tags: Tags to add to the evaluation run.
  • headers: Headers to pass to the API call

EvaluationRunClient.get_eval_run_results

def get_eval_run_results(eval_run_name: str,
                         workspace: Optional[str] = None,
                         headers: Optional[dict] = None) -> Dict[str, Any]

Collects and returns the predictions of an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run to fetch results for.
  • workspace: Specifies the name of the deepset Cloud workspace where the evaluation run exists. If set to None, the EvaluationRunClient's default workspace is used.
  • headers: The headers that you want to pass to the API call.

EvaluationRunClient.get_eval_run_predictions

def get_eval_run_predictions(
        eval_run_name: str,
        node_name: str,
        workspace: Optional[str] = None,
        headers: Optional[dict] = None) -> List[Dict[str, Any]]

Fetches predictions for the evaluation run and a node name you specify.

Arguments:

  • eval_run_name: The name of the evaluation run to fetch predictions for.
  • node_name: The name of the node to fetch predictions for.
  • workspace: Specifies the name of the deepset Cloud workspace where the evaluation run exists. If set to None, the EvaluationRunClient's default workspace is used.
  • headers: The headers that you want to pass to the API call.

DeepsetCloud

class DeepsetCloud()

A facade to communicate with deepset Cloud.

DeepsetCloud.get_index_client

@classmethod
def get_index_client(cls,
                     api_key: Optional[str] = None,
                     api_endpoint: Optional[str] = None,
                     workspace: str = "default",
                     index: Optional[str] = None) -> IndexClient

Creates a client to communicate with deepset Cloud indexes.

Arguments:

  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.
  • workspace: Specifies the name of the workspace for which you want to create the client.
  • index: index in deepset Cloud workspace

DeepsetCloud.get_pipeline_client

@classmethod
def get_pipeline_client(
        cls,
        api_key: Optional[str] = None,
        api_endpoint: Optional[str] = None,
        workspace: str = "default",
        pipeline_config_name: Optional[str] = None) -> PipelineClient

Creates a client to communicate with deepset Cloud pipelines.

Arguments:

  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.
  • workspace: Specifies the name of the workspace for which you want to create the client.
  • pipeline_config_name: name of the pipeline_config in deepset Cloud workspace

DeepsetCloud.get_evaluation_set_client

@classmethod
def get_evaluation_set_client(
        cls,
        api_key: Optional[str] = None,
        api_endpoint: Optional[str] = None,
        workspace: str = "default",
        evaluation_set: str = "default") -> EvaluationSetClient

Creates a client to communicate with deepset Cloud labels.

Arguments:

  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.
  • workspace: Specifies the name of the workspace for which you want to create the client.
  • evaluation_set: name of the evaluation set in deepset Cloud

DeepsetCloud.get_eval_run_client

@classmethod
def get_eval_run_client(cls,
                        api_key: Optional[str] = None,
                        api_endpoint: Optional[str] = None,
                        workspace: str = "default") -> EvaluationRunClient

Creates a client to manage evaluation runs on deepset Cloud.

Arguments:

  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.
  • workspace: Specifies the name of the workspace for which you want to create the client.

DeepsetCloud.get_file_client

@classmethod
def get_file_client(cls,
                    api_key: Optional[str] = None,
                    api_endpoint: Optional[str] = None,
                    workspace: str = "default") -> FileClient

Creates a client to manage files on deepset Cloud.

Arguments:

  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.
  • workspace: Specifies the name of the workspace for which you want to create the client.

DeepsetCloudExperiments

class DeepsetCloudExperiments()

A facade to conduct and manage experiments within deepset Cloud.

To start a new experiment run:

  1. Choose a pipeline to evaluate using list_pipelines().
  2. Choose an evaluation set using list_evaluation_sets().
  3. Create and start a new run using create_and_start_run().
  4. Track the run using get_run(). When the run finishes, you can use the eval_results key in the returned dictionary to view the metrics.
  5. Inspect the result of a run in detail using get_run_result(). This returns an EvaluationResult object containing all the predictions and gold labels in the form of pandas dataframes. Use calculate_metrics() to recalculate metrics using different settings (for example, top_k) and wrong_examples() to show worst performing queries/labels.

DeepsetCloudExperiments.list_pipelines

@classmethod
def list_pipelines(cls,
                   workspace: str = "default",
                   api_key: Optional[str] = None,
                   api_endpoint: Optional[str] = None) -> List[dict]

Lists all pipelines available on deepset Cloud.

Arguments:

  • workspace: Specifies the name of the workspace on deepset Cloud.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

Returns: list of dictionaries: List[dict] each dictionary: { "name": str -> pipeline_config_name to be used in load_from_deepset_cloud(), "..." -> additional pipeline meta information } example:

```python
[{'name': 'my_super_nice_pipeline_config',
    'pipeline_id': '2184e0c1-c6ec-40a1-9b28-5d2768e5efa2',
    'status': 'DEPLOYED',
    'created_at': '2022-02-01T09:57:03.803991+00:00',
    'deleted': False,
    'is_default': False,
    'indexing': {'status': 'IN_PROGRESS',
    'pending_file_count': 3,
    'total_file_count': 31}}]
```

DeepsetCloudExperiments.list_evaluation_sets

@classmethod
def list_evaluation_sets(cls,
                         workspace: str = "default",
                         api_key: Optional[str] = None,
                         api_endpoint: Optional[str] = None) -> List[dict]

Lists all evaluation sets available on deepset Cloud.

Arguments:

  • workspace: Specifies the name of the workspace on deepset Cloud.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

Returns: list of dictionaries: List[dict] each dictionary: { "name": str -> evaluation_set to be used in create_run(), "..." -> additional pipeline meta information } example:

```python
[{'evaluation_set_id': 'fb084729-57ad-4b57-9f78-ec0eb4d29c9f',
    'name': 'my-question-answering-evaluation-set',
    'created_at': '2022-05-06T09:54:14.830529+00:00',
    'matched_labels': 234,
    'total_labels': 234}]
```

DeepsetCloudExperiments.get_runs

@classmethod
def get_runs(cls,
             workspace: str = "default",
             api_key: Optional[str] = None,
             api_endpoint: Optional[str] = None) -> List[dict]

Gets all evaluation runs.

Arguments:

  • workspace: Specifies the name of the workspace on deepset Cloud.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

Returns: list of dictionaries: List[dict] example:

```python
[{'eval_run_name': 'my-eval-run-1',
    'parameters': {
        'pipeline_name': 'my-pipeline-1_696bc5d0-ee65-46c1-a308-059507bc353b',
        'evaluation_set_name': 'my-eval-set-name',
        'debug': False,
        'eval_mode': 0
    },
    'metrics': {
        'isolated_exact_match': 0.45,
        'isolated_f1': 0.89,
        'isolated_sas': 0.91,
        'integrated_exact_match': 0.39,
        'integrated_f1': 0.76,
        'integrated_sas': 0.78,
        'mean_reciprocal_rank': 0.77,
        'mean_average_precision': 0.78,
        'recall_single_hit': 0.91,
        'recall_multi_hit': 0.91,
        'normal_discounted_cummulative_gain': 0.83,
        'precision': 0.52
    },
    'logs': {},
    'status': 1,
    'eval_mode': 0,
    'eval_run_labels': [],
    'created_at': '2022-05-24T12:13:16.445857+00:00',
    'comment': 'This is a comment about thiseval run',
    'tags': ['experiment-1', 'experiment-2', 'experiment-3']
    }]
```

DeepsetCloudExperiments.create_run

@classmethod
def create_run(cls,
               eval_run_name: str,
               workspace: str = "default",
               api_key: Optional[str] = None,
               api_endpoint: Optional[str] = None,
               pipeline_config_name: Optional[str] = None,
               evaluation_set: Optional[str] = None,
               eval_mode: Literal["integrated", "isolated"] = "integrated",
               debug: bool = False,
               comment: Optional[str] = None,
               tags: Optional[List[str]] = None) -> Dict[str, Any]

Creates an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • pipeline_config_name: The name of the pipeline to evaluate. Use list_pipelines() to list all available pipelines.
  • evaluation_set: The name of the evaluation set to use. Use list_evaluation_sets() to list all available evaluation sets.
  • eval_mode: The evaluation mode to use.
  • debug: Whether to enable debug output.
  • comment: Comment to add about to the evaluation run.
  • tags: Tags to add to the evaluation run.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

DeepsetCloudExperiments.update_run

@classmethod
def update_run(cls,
               eval_run_name: str,
               workspace: str = "default",
               api_key: Optional[str] = None,
               api_endpoint: Optional[str] = None,
               pipeline_config_name: Optional[str] = None,
               evaluation_set: Optional[str] = None,
               eval_mode: Literal["integrated", "isolated"] = "integrated",
               debug: bool = False,
               comment: Optional[str] = None,
               tags: Optional[List[str]] = None) -> Dict[str, Any]

Updates an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run to update.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the FileClient's default workspace is used.
  • pipeline_config_name: The name of the pipeline to evaluate. Use list_pipelines() to list all available pipelines.
  • evaluation_set: The name of the evaluation set to use. Use list_evaluation_sets() to list all available evaluation sets.
  • eval_mode: The evaluation mode to use.
  • debug: Whether to enable debug output.
  • comment: Comment to add about to the evaluation run.
  • tags: Tags to add to the evaluation run.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

DeepsetCloudExperiments.get_run

@classmethod
def get_run(cls,
            eval_run_name: str,
            workspace: str = "default",
            api_key: Optional[str] = None,
            api_endpoint: Optional[str] = None) -> Dict[str, Any]

Gets the evaluation run and shows its parameters and metrics.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

DeepsetCloudExperiments.delete_run

@classmethod
def delete_run(cls,
               eval_run_name: str,
               workspace: str = "default",
               api_key: Optional[str] = None,
               api_endpoint: Optional[str] = None)

Deletes an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

DeepsetCloudExperiments.start_run

@classmethod
def start_run(cls,
              eval_run_name: str,
              workspace: str = "default",
              api_key: Optional[str] = None,
              api_endpoint: Optional[str] = None)

Starts an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

DeepsetCloudExperiments.create_and_start_run

@classmethod
def create_and_start_run(cls,
                         eval_run_name: str,
                         workspace: str = "default",
                         api_key: Optional[str] = None,
                         api_endpoint: Optional[str] = None,
                         pipeline_config_name: Optional[str] = None,
                         evaluation_set: Optional[str] = None,
                         eval_mode: Literal["integrated",
                                            "isolated"] = "integrated",
                         debug: bool = False,
                         comment: Optional[str] = None,
                         tags: Optional[List[str]] = None)

Creates and starts an evaluation run.

Arguments:

  • eval_run_name: The name of the evaluation run.
  • workspace: Specifies the name of the workspace on deepset Cloud. If set to None, the EvaluationRunClient's default workspace is used.
  • pipeline_config_name: The name of the pipeline to evaluate. Use list_pipelines() to list all available pipelines.
  • evaluation_set: The name of the evaluation set to use. Use list_evaluation_sets() to list all available evaluation sets.
  • eval_mode: The evaluation mode to use.
  • debug: Whether to enable debug output.
  • comment: Comment to add about to the evaluation run.
  • tags: Tags to add to the evaluation run.
  • api_key: Secret value of the API key. If not specified, it's read from DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If environment variable is not set, defaults to 'https://api.cloud.deepset.ai/api/v1'.

DeepsetCloudExperiments.get_run_result

@classmethod
def get_run_result(cls,
                   eval_run_name: str,
                   workspace: str = "default",
                   api_key: Optional[str] = None,
                   api_endpoint: Optional[str] = None) -> EvaluationResult

Fetches the results of an evaluation run and turns them into an EvaluationResult object.

Arguments:

  • eval_run_name: The name of the evaluation run whose results you want to fetch.
  • workspace: Specifies the name of the deepset Cloud workspace where the evaluation run exists. If set to None, the EvaluationRunClient's default workspace is used.
  • api_key: Secret value of the API key. If not specified, it's read from the DEEPSET_CLOUD_API_KEY environment variable.
  • api_endpoint: The URL of the deepset Cloud API. If not specified, it's read from the DEEPSET_CLOUD_API_ENDPOINT environment variable. If the environment variable is not set, it defaults to 'https://api.cloud.deepset.ai/api/v1'.

Module docker

cache_models

def cache_models(models: Optional[List[str]] = None,
                 use_auth_token: Optional[Union[str, bool]] = None)

Small function that caches models and other data.

Used only in the Dockerfile to include these caches in the images.

Arguments:

cache_schema

def cache_schema()

Generate and persist Haystack JSON schema.

The schema is lazily generated at first usage, but this might not work in Docker containers when the user running Haystack doesn't have write permissions on the Python installation. By calling this function at Docker image build time, the schema is generated once for all.