These are the core classes that carry data through the system.
Module schema
ContentTypes
Types of content_types supported
Document
@dataclass
class Document()
Document.__init__
def __init__(content: Union[str, DataFrame],
content_type: ContentTypes = "text",
id: Optional[str] = None,
score: Optional[float] = None,
meta: Optional[Dict[str, Any]] = None,
embedding: Optional[ndarray] = None,
id_hash_keys: Optional[List[str]] = None)
One of the core data classes in Haystack. It's used to represent documents / passages in a standardized way within Haystack.
Documents are stored in DocumentStores, are returned by Retrievers, are the input for Readers and are used in
many other places that manipulate or interact with document-level data.
Note: There can be multiple Documents originating from one file (e.g. PDF), if you split the text
into smaller passages. We'll have one Document per passage in this case.
Each document has a unique ID. This can be supplied by the user or generated automatically.
It's particularly helpful for handling of duplicates and referencing documents in other objects (e.g. Labels)
There's an easy option to convert from/to dicts via from_dict()
and to_dict
.
Arguments:
content
: Content of the document. For most cases, this will be text, but it can be a table or image.content_type
: One of "text", "table", "image" or "audio". Haystack components can use this to adjust their handling of Documents and check compatibility.id
: Unique ID for the document. If not supplied by the user, we'll generate one automatically by creating a hash from the supplied text. This behaviour can be further adjusted byid_hash_keys
.score
: The relevance score of the Document determined by a model (e.g. Retriever or Re-Ranker). If model'sscale_score
was set to True (default) score is in the unit interval (range of [0,1]), where 1 means extremely relevant.meta
: Meta fields for a document like name, url, or author in the form of a custom dict (any keys and values allowed).embedding
: Vector encoding of the textid_hash_keys
: Generate the document id from a custom list of strings that refer to the document's attributes. To ensure you don't have duplicate documents in your DocumentStore if texts are not unique, modify the metadata and pass, for example, "meta" to this field (example: ["content", "meta"]). In this case, the id is generated by using the content and the defined metadata. If you specify a custom ID for theid
parameter, theid_hash_keys
parameter is ignored and the custom ID is used.
Note that you can use even nested fields of the meta
as id_hash_keys. For example, if you
have a key in meta
called url
and you want to use it as part of the id, you can pass
this parameter as ["meta.url"]
. Haystack supports a maximum depth of 1. For example, if you
use meta.url.path
, it looks for the url.path
key in the meta
dict, for example meta['url.path']
.
Document.to_dict
def to_dict(field_map: Optional[Dict[str, Any]] = None) -> Dict
Convert Document to dict. An optional field_map can be supplied to change the names of the keys in the
resulting dict. This way you can work with standardized Document objects in Haystack, but adjust the format that they are serialized / stored in other places (e.g. elasticsearch) Example:
doc = Document(content="some text", content_type="text")
doc.to_dict(field_map={"custom_content_field": "content"})
# Returns {"custom_content_field": "some text", content_type": "text"}
Arguments:
field_map
: Dict with keys being the custom target keys and values being the standard Document attributes
Returns:
dict with content of the Document
Document.from_dict
@classmethod
def from_dict(cls,
dict: Dict[str, Any],
field_map: Optional[Dict[str, Any]] = None) -> Document
Create Document from dict. An optional field_map
parameter can be supplied to adjust for custom names of the keys in the
input dict. This way you can work with standardized Document objects in Haystack, but adjust the format that they are serialized / stored in other places (e.g. elasticsearch).
Example:
my_dict = {"custom_content_field": "some text", "content_type": "text"}
Document.from_dict(my_dict, field_map={"custom_content_field": "content"})
Arguments:
field_map
: Dict with keys being the custom target keys and values being the standard Document attributes
Returns:
A Document object
Document.__lt__
def __lt__(other)
Enable sorting of Documents by score
Span
@dataclass
class Span()
end
Defining a sequence of characters (Text span) or cells (Table span) via start and end index.
For extractive QA: Character where answer starts/ends
Arguments:
start
: Position where the span startsend
: Position where the span ends
Span.__contains__
def __contains__(value)
Checks for inclusion of the given value into the interval defined by Span.
assert 10 in Span(5, 15) # True
assert 20 in Span(1, 15) # False
Includes the left edge, but not the right edge.
assert 5 in Span(5, 15) # True
assert 15 in Span(5, 15) # False
Works for numbers and all values that can be safely converted into floats.
assert 10.0 in Span(5, 15) # True
assert "10" in Span(5, 15) # True
It also works for Span objects, returning True only if the given Span is fully contained into the original Span. As for numerical values, the left edge is included, the right edge is not.
assert Span(10, 11) in Span(5, 15) # True
assert Span(5, 10) in Span(5, 15) # True
assert Span(10, 15) in Span(5, 15) # False
assert Span(5, 15) in Span(5, 15) # False
assert Span(5, 14) in Span(5, 15) # True
assert Span(0, 1) in Span(5, 15) # False
assert Span(0, 10) in Span(5, 15) # False
assert Span(10, 20) in Span(5, 15) # False
TableCell
@dataclass
class TableCell()
col
Defining a table cell via the row and column index.
Arguments:
row
: Row index of the cellcol
: Column index of the cell
Answer
@dataclass
class Answer()
meta
The fundamental object in Haystack to represent any type of Answers (e.g. extractive QA, generative QA or TableQA).
For example, it's used within some Nodes like the Reader, but also in the REST API.
Arguments:
answer
: The answer string. If there's no possible answer (aka "no_answer" or "is_impossible) this will be an empty string.type
: One of ("generative", "extractive", "other"): Whether this answer comes from an extractive model (i.e. we can locate an exact answer string in one of the documents) or from a generative model (i.e. no pointer to a specific document, no offsets ...).score
: The relevance score of the Answer determined by a model (e.g. Reader or Generator). In the range of [0,1], where 1 means extremely relevant.context
: The related content that was used to create the answer (i.e. a text passage, part of a table, image ...)offsets_in_document
: List ofSpan
objects with start and end positions of the answer in the document (as stored in the document store). For extractive QA: Character where answer starts =>Answer.offsets_in_document[0].start For TableQA: Cell where the answer starts (counted from top left to bottom right of table) =>
Answer.offsets_in_document[0].start (Note that in TableQA there can be multiple cell ranges that are relevant for the answer, thus there can be multipleSpans
here)offsets_in_context
: List ofSpan
objects with start and end positions of the answer in the context (i.e. the surrounding text/table of a certain window size). For extractive QA: Character where answer starts =>Answer.offsets_in_document[0].start For TableQA: Cell where the answer starts (counted from top left to bottom right of table) =>
Answer.offsets_in_document[0].start (Note that in TableQA there can be multiple cell ranges that are relevant for the answer, thus there can be multipleSpans
here)document_ids
: IDs of the documents the answer came from (if any). For extractive QA, this will be a list of length 1. For generative QA, this will be a list of length > 0.meta
: Dict that can be used to associate any kind of custom meta data with the answer. In extractive QA, this will carry the meta data of the document where the answer was found.
Answer.__lt__
def __lt__(other)
Enable sorting of Answers by score
Label
@dataclass
class Label()
Label.__init__
def __init__(query: str,
document: Document,
is_correct_answer: bool,
is_correct_document: bool,
origin: Literal["user-feedback", "gold-label"],
answer: Optional[Answer],
id: Optional[str] = None,
pipeline_id: Optional[str] = None,
created_at: Optional[str] = None,
updated_at: Optional[str] = None,
meta: Optional[dict] = None,
filters: Optional[Dict[str, Any]] = None)
Object used to represent label/feedback in a standardized way within Haystack.
This includes labels from dataset like SQuAD, annotations from labeling tools, or, user-feedback from the Haystack REST API.
Arguments:
query
: the question (or query) for finding answers.document
:answer
: the answer object.is_correct_answer
: whether the sample is positive or negative.is_correct_document
: in case of negative sample(is_correct_answer is False), there could be two cases; incorrect answer but correct document & incorrect document. This flag denotes if the returned document was correct.origin
: the source for the labels. It can be used to later for filtering.id
: Unique ID used within the DocumentStore. If not supplied, a uuid will be generated automatically.pipeline_id
: pipeline identifier (any str) that was involved for generating this label (in-case of user feedback).created_at
: Timestamp of creation with format yyyy-MM-dd HH:mm:ss. Generate in Python via time.strftime("%Y-%m-%d %H:%M:%S").created_at
: Timestamp of update with format yyyy-MM-dd HH:mm:ss. Generate in Python via time.strftime("%Y-%m-%d %H:%M:%S")meta
: Meta fields like "annotator_name" in the form of a custom dict (any keys and values allowed).filters
: filters that should be applied to the query to rule out non-relevant documents. For example, if there are different correct answers in a DocumentStore depending on the retrieved document and the answer in this label is correct only on condition of the filters.
MultiLabel
class MultiLabel()
MultiLabel.__init__
def __init__(labels: List[Label],
drop_negative_labels=False,
drop_no_answers=False)
There are often multiple Labels
associated with a single query. For example, there can be multiple annotated
answers for one question or multiple documents contain the information you want for a query. This class is "syntactic sugar" that simplifies the work with such a list of related Labels. It stores the original labels in MultiLabel.labels and provides additional aggregated attributes that are automatically created at init time. For example, MultiLabel.no_answer allows you to easily access if any of the underlying Labels provided a text answer and therefore demonstrates that there is indeed a possible answer.
Arguments:
labels
: A list of labels that belong to a similar query and shall be "grouped" togetherdrop_negative_labels
: Whether to drop negative labels from that group (e.g. thumbs down feedback from UI)drop_no_answers
: Whether to drop labels that specify the answer is impossible
EvaluationResult
class EvaluationResult()
EvaluationResult.__init__
def __init__(node_results: Optional[Dict[str, DataFrame]] = None) -> None
A convenience class to store, pass, and interact with results of a pipeline evaluation run (for example pipeline.eval()
).
Detailed results are stored as one dataframe per node. This class makes them more accessible and provides convenience methods to work with them. For example, you can calculate eval metrics, get detailed reports, or simulate different top_k settings:
eval_results = pipeline.eval(...)
# derive detailed metrics
eval_results.calculate_metrics()
# show summary of incorrect queries
eval_results.wrong_examples()
Each row of the underlying DataFrames contains either an answer or a document that has been retrieved during evaluation. Rows are enriched with basic information like rank, query, type, or node. Additional answer or document-specific evaluation information, like gold labels and metrics showing whether the row matches the gold labels, are included, too. The DataFrames have the following schema:
- multilabel_id: The ID of the multilabel, which is unique for the pair of query and filters.
- query: The actual query string.
- filters: The filters used with the query.
- gold_answers (answers only): The expected answers.
- answer (answers only): The actual answer.
- context: The content of the document (the surrounding context of the answer for QA).
- exact_match (answers only): A metric showing if the answer exactly matches the gold label.
- f1 (answers only): A metric showing how well the answer overlaps with the gold label on a token basis.
- sas (answers only, optional): A metric showing how well the answer matches the gold label on a semantic basis.
- exact_match_context_scope (answers only): exact_match with enforced context match.
- f1_context_scope (answers only): f1 with enforced context scope match.
- sas_context_scope (answers only): sas with enforced context scope match.
- exact_match_document_scope (answers only): exact_match with enforced document scope match.
- f1_document_scope (answers only): f1 with enforced document scope match.
- sas_document_scope (answers only): sas with enforced document scope match.
- exact_match_document_id_and_context_scope: (answers only): exact_match with enforced document and context scope match.
- f1_document_id_and_context_scope (answers only): f1 with enforced document and context scope match.
- sas_document_id_and_context_scope (answers only): sas with enforced document and context scope match.
- gold_contexts: The contents of the gold documents.
- gold_id_match (documents only): A metric showing whether one of the gold document IDs matches the document.
- context_match (documents only): A metric showing whether one of the gold contexts matches the document content.
- answer_match (documents only): A metric showing whether the document contains the answer.
- gold_id_or_answer_match (documents only): A Boolean operation specifying that there should be either
'gold_id_match' OR 'answer_match'
. - gold_id_and_answer_match (documents only): A Boolean operation specifying that there should be both
'gold_id_match' AND 'answer_match'
. - gold_id_or_context_match (documents only): A Boolean operation specifying that there should be either
'gold_id_match' OR 'context_match'
. - gold_id_and_context_match (documents only): A Boolean operation specifying that there should be both
'gold_id_match' AND 'context_match'
. - gold_id_and_context_and_answer_match (documents only): A Boolean operation specifying that there should be
'gold_id_match' AND 'context_match' AND 'answer_match'
. - context_and_answer_match (documents only): A Boolean operation specifying that there should be both
'context_match' AND 'answer_match'
. - rank: A rank or 1-based-position in the result list.
- document_id: The ID of the document that has been retrieved or that contained the answer.
- gold_document_ids: The IDs of the documents to be retrieved.
- custom_document_id: The custom ID of the document (specified by
custom_document_id_field
) that has been retrieved or that contained the answer. - gold_custom_document_ids: The custom documents IDs (specified by
custom_document_id_field
) to be retrieved. - offsets_in_document (answers only): The position or offsets within the document where the answer was found.
- gold_offsets_in_documents (answers only): The position or offsets of the gold answer within the document.
- gold_answers_exact_match (answers only): exact_match values per gold_answer.
- gold_answers_f1 (answers only): f1 values per gold_answer.
- gold_answers_sas (answers only): sas values per gold answer.
- gold_documents_id_match: The document ID match per gold label (if
custom_document_id_field
has been specified, custom IDs are used). - gold_contexts_similarity: Context similarity per gold label.
- gold_answers_match (documents only): Specifies whether the document contains an answer per gold label.
- type: Possible values: 'answer' or 'document'.
- node: The node name
- eval_mode: Specifies whether the evaluation was executed in integrated or isolated mode. Check pipeline.eval()'s add_isolated_node_eval parameter for more information.
Arguments:
node_results
: The evaluation Dataframes per pipeline node.
EvaluationResult.calculate_metrics
def calculate_metrics(
simulated_top_k_reader: int = -1,
simulated_top_k_retriever: int = -1,
document_scope: Literal[
"document_id",
"context",
"document_id_and_context",
"document_id_or_context",
"answer",
"document_id_or_answer",
] = "document_id_or_answer",
eval_mode: Literal["integrated", "isolated"] = "integrated",
answer_scope: Literal["any", "context", "document_id",
"document_id_and_context"] = "any"
) -> Dict[str, Dict[str, float]]
Calculates proper metrics for each node.
For Nodes that return Documents, the default metrics are:
- mrr (
Mean Reciprocal Rank <https://en.wikipedia.org/wiki/Mean_reciprocal_rank>
_) - map (
Mean Average Precision <https://en.wikipedia.org/wiki/Evaluation_measures_%28information_retrieval%29#Mean_average_precision>
_) - ndcg (
Normalized Discounted Cumulative Gain <https://en.wikipedia.org/wiki/Discounted_cumulative_gain>
_) - precision (Precision: How many of the returned documents were relevant?)
- recall_multi_hit (Recall according to Information Retrieval definition: How many of the relevant documents were retrieved per query?)
- recall_single_hit (Recall for Question Answering: How many of the queries returned at least one relevant document?)
For Nodes that return answers, the default metrics are:
- exact_match (How many of the queries returned the exact answer?)
- f1 (How well do the returned results overlap with any gold answer on a token basis?)
- sas, if a SAS model has been provided when calling
pipeline.eval()
(How semantically similar is the prediction to the gold answers?)
During the eval run, you can simulate lower top_k values for Reader and Retriever than the actual values.
For example, you can calculate top_1_f1
for Reader nodes by setting simulated_top_k_reader=1
.
If you applied simulated_top_k_retriever
to a Reader node, you should treat the results with caution as they can differ from an actual eval run with a corresponding top_k_retriever
heavily.
Arguments:
simulated_top_k_reader
: Simulates thetop_k
parameter of the Reader.simulated_top_k_retriever
: Simulates thetop_k
parameter of the Retriever. Note: There might be a discrepancy between simulated Reader metrics and an actual Pipeline run with Retrievertop_k
.eval_mode
: The input the Node was evaluated on. Usually a Node gets evaluated on the prediction provided by its predecessor Nodes in the Pipeline (value='integrated'
). However, as the quality of the Node can heavily depend on the Node's input and thus the predecessor's quality, you might want to simulate a perfect predecessor in order to get an independent upper bound of the quality of your Node. For example, when evaluating the Reader, usevalue='isolated'
to simulate a perfect Retriever in an ExtractiveQAPipeline. Possible values are:integrated
,isolated
. The default value isintegrated
.document_scope
: A criterion for deciding whether documents are relevant or not. You can select between:- 'document_id': Specifies that the document ID must match. You can specify a custom document ID through
pipeline.eval()
'scustom_document_id_field
param. A typical use case is Document Retrieval. - 'context': Specifies that the content of the document must match. Uses fuzzy matching (see
pipeline.eval()
'scontext_matching_...
params). A typical use case is Document-Independent Passage Retrieval. - 'document_id_and_context': A Boolean operation specifying that both
'document_id' AND 'context'
must match. A typical use case is Document-Specific Passage Retrieval. - 'document_id_or_context': A Boolean operation specifying that either
'document_id' OR 'context'
must match. A typical use case is Document Retrieval having sparse context labels. - 'answer': Specifies that the document contents must include the answer. The selected
answer_scope
is enforced automatically. A typical use case is Question Answering. - 'document_id_or_answer' (default): A Boolean operation specifying that either
'document_id' OR 'answer'
must match. This is intended to be a proper default value in order to support both main use cases: - Document Retrieval - Question Answering The default value is 'document_id_or_answer'. answer_scope
: Specifies the scope in which a matching answer is considered correct. You can select between:- 'any' (default): Any matching answer is considered correct.
- 'context': The answer is only considered correct if its context matches as well.
Uses fuzzy matching (see
pipeline.eval()
'scontext_matching_...
params). - 'document_id': The answer is only considered correct if its document ID matches as well.
You can specify a custom document ID through
pipeline.eval()
'scustom_document_id_field
param. - 'document_id_and_context': The answer is only considered correct if its document ID and its context match as well.
The default value is 'any'.
In Question Answering, to enforce that the retrieved document is considered correct whenever the answer is correct, set
document_scope
to 'answer' or 'document_id_or_answer'.
EvaluationResult.wrong_examples
def wrong_examples(
node: str,
n: int = 3,
simulated_top_k_reader: int = -1,
simulated_top_k_retriever: int = -1,
document_scope: Literal[
"document_id",
"context",
"document_id_and_context",
"document_id_or_context",
"answer",
"document_id_or_answer",
] = "document_id_or_answer",
document_metric: str = "recall_single_hit",
answer_metric: str = "f1",
document_metric_threshold: float = 0.5,
answer_metric_threshold: float = 0.5,
eval_mode: Literal["integrated", "isolated"] = "integrated",
answer_scope: Literal["any", "context", "document_id",
"document_id_and_context"] = "any"
) -> List[Dict]
Returns the worst performing queries.
Worst performing queries are calculated based on the metric that is either a document metric or an answer metric according to the node type.
Lower top_k values for reader and retriever than the actual values during the eval run can be simulated. See calculate_metrics() for more information.
Arguments:
simulated_top_k_reader
: simulates top_k param of readersimulated_top_k_retriever
: simulates top_k param of retriever. remarks: there might be a discrepancy between simulated reader metrics and an actual pipeline run with retriever top_kdocument_metric
: the document metric worst queries are calculated with. values can be: 'recall_single_hit', 'recall_multi_hit', 'mrr', 'map', 'precision'answer_metric
: the answer metric worst queries are calculated with. values can be: 'f1', 'exact_match' and 'sas' if the evaluation was made using a SAS model.document_metric_threshold
: the threshold for the document metric (only samples below selected metric threshold will be considered)answer_metric_threshold
: the threshold for the answer metric (only samples below selected metric threshold will be considered)eval_mode
: the input on which the node was evaluated on. Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='integrated'). However, as the quality of the node itself can heavily depend on the node's input and thus the predecessor's quality, you might want to simulate a perfect predecessor in order to get an independent upper bound of the quality of your node. For example when evaluating the reader use value='isolated' to simulate a perfect retriever in an ExtractiveQAPipeline. Values can be 'integrated', 'isolated'. Default value is 'integrated'.document_scope
: A criterion for deciding whether documents are relevant or not. You can select between:- 'document_id': Specifies that the document ID must match. You can specify a custom document ID through
pipeline.eval()
'scustom_document_id_field
param. A typical use case is Document Retrieval. - 'context': Specifies that the content of the document must match. Uses fuzzy matching (see
pipeline.eval()
'scontext_matching_...
params). A typical use case is Document-Independent Passage Retrieval. - 'document_id_and_context': A Boolean operation specifying that both
'document_id' AND 'context'
must match. A typical use case is Document-Specific Passage Retrieval. - 'document_id_or_context': A Boolean operation specifying that either
'document_id' OR 'context'
must match. A typical use case is Document Retrieval having sparse context labels. - 'answer': Specifies that the document contents must include the answer. The selected
answer_scope
is enforced automatically. A typical use case is Question Answering. - 'document_id_or_answer' (default): A Boolean operation specifying that either
'document_id' OR 'answer'
must match. This is intended to be a proper default value in order to support both main use cases: - Document Retrieval - Question Answering The default value is 'document_id_or_answer'. answer_scope
: Specifies the scope in which a matching answer is considered correct. You can select between:- 'any' (default): Any matching answer is considered correct.
- 'context': The answer is only considered correct if its context matches as well.
Uses fuzzy matching (see
pipeline.eval()
'scontext_matching_...
params). - 'document_id': The answer is only considered correct if its document ID matches as well.
You can specify a custom document ID through
pipeline.eval()
'scustom_document_id_field
param. - 'document_id_and_context': The answer is only considered correct if its document ID and its context match as well.
The default value is 'any'.
In Question Answering, to enforce that the retrieved document is considered correct whenever the answer is correct, set
document_scope
to 'answer' or 'document_id_or_answer'.
EvaluationResult.save
def save(out_dir: Union[str, Path], **to_csv_kwargs)
Saves the evaluation result.
The result of each node is saved in a separate csv with file name {node_name}.csv to the out_dir folder.
Arguments:
out_dir
: Path to the target folder the csvs will be saved.to_csv_kwargs
: kwargs to be passed to DataFrame.to_csv(). See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html. This method uses different default values than DataFrame.to_csv() for the following parameters: index=False, quoting=csv.QUOTE_NONNUMERIC (to avoid problems with \r chars)
EvaluationResult.load
@classmethod
def load(cls, load_dir: Union[str, Path], **read_csv_kwargs)
Loads the evaluation result from disk. Expects one csv file per node. See save() for further information.
Arguments:
load_dir
: The directory containing the csv files.read_csv_kwargs
: kwargs to be passed to pd.read_csv(). See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html. This method uses different default values than pd.read_csv() for the following parameters: header=0, converters=CONVERTERS where CONVERTERS is a dictionary mapping all array typed columns to ast.literal_eval.