The utility classes of Haystack.
Module docs2answers
Docs2Answers
class Docs2Answers(BaseComponent)
This Node is used to convert retrieved documents into predicted answers format.
It is useful for situations where you are calling a Retriever only pipeline via REST API. This ensures that your output is in a compatible format.
Arguments:
progress_bar
: Whether to show a progress bar
Module join_docs
JoinDocuments
class JoinDocuments(JoinNode)
A node to join documents outputted by multiple retriever nodes.
The node allows multiple join modes:
- concatenate: combine the documents from multiple nodes. Any duplicate documents are discarded. The score is only determined by the last node that outputs the document.
- merge: merge scores of documents from multiple nodes. Optionally, each input score can be given a different
weight
& atop_k
limit can be set. This mode can also be used for "reranking" retrieved documents. - reciprocal_rank_fusion: combines the documents based on their rank in multiple nodes.
JoinDocuments.__init__
def __init__(join_mode: str = "concatenate",
weights: Optional[List[float]] = None,
top_k_join: Optional[int] = None,
sort_by_score: bool = True)
Arguments:
join_mode
:concatenate
to combine documents from multiple retrieversmerge
to aggregate scores of individual documents,reciprocal_rank_fusion
to apply rank based scoring.weights
: A node-wise list(length of list must be equal to the number of input nodes) of weights for adjusting document scores when using themerge
join_mode. By default, equal weight is given to each retriever score. This param is not compatible with theconcatenate
join_mode.top_k_join
: Limit documents to top_k based on the resulting scores of the join.sort_by_score
: Whether to sort the incoming documents by their score. Set this to True if all your Documents are coming withscore
values. Set to False if any of the Documents come from sources where thescore
is set toNone
, likeTfidfRetriever
on Elasticsearch.
Module join_answers
JoinAnswers
class JoinAnswers(JoinNode)
A node to join Answer
s produced by multiple Reader
nodes.
JoinAnswers.__init__
def __init__(join_mode: str = "concatenate",
weights: Optional[List[float]] = None,
top_k_join: Optional[int] = None,
sort_by_score: bool = True)
Arguments:
join_mode
:"concatenate"
to combine documents from multipleReader
s."merge"
to aggregate scores of individualAnswer
s.weights
: A node-wise list (length of list must be equal to the number of input nodes) of weights for adjustingAnswer
scores when using the"merge"
join_mode. By default, equal weight is assigned to eachReader
score. This parameter is not compatible with the"concatenate"
join_mode.top_k_join
: LimitAnswer
s to top_k based on the resulting scored of the join.sort_by_score
: Whether to sort the incoming answers by their score. Set this to True if your Answers are coming from a Reader or TableReader. Set to False if any Answers come from a Generator since this assigns None as a score to each.
Module route_documents
RouteDocuments
class RouteDocuments(BaseComponent)
A node to split a list of Document
s by content_type
or by the values of a metadata field and route them to
different nodes.
RouteDocuments.__init__
def __init__(split_by: str = "content_type",
metadata_values: Optional[Union[List[str],
List[List[str]]]] = None,
return_remaining: bool = False)
Arguments:
split_by
: Field to split the documents by, either"content_type"
or a metadata field name. If this parameter is set to"content_type"
, the list ofDocument
s will be split into a list containing onlyDocument
s of type"text"
(will be routed to"output_1"
) and a list containing onlyDocument
s of type"table"
(will be routed to"output_2"
). If this parameter is set to a metadata field name, you need to specify the parametermetadata_values
as well. :param metadata_values: A list of values to groupDocument
s by metadata field. If the parametersplit_by
is set to a metadata field name, you must provide a list of values (or a list of lists of values) to group theDocument
s by. Ifmetadata_values
is a list of strings, then theDocument
s whose metadata field is equal to the corresponding value will be routed to the output with the same index. Ifmetadata_values
is a list of lists, then theDocument
s whose metadata field is equal to the first value of the provided sublist will be routed to"output_1"
, theDocument
s whose metadata field is equal to the second value of the provided sublist will be routed to"output_2"
, and so on.return_remaining
: Whether to return all remaining documents that don't match thesplit_by
ormetadata_values
into an additional output route. This additional output route will be indexed to plus one of the previous last output route. For example, if there would normally be"output_1"
and"output_2"
when return_remaining is False, then when return_remaining is True the additional output route would be"output_3"
.
Module document_merger
DocumentMerger
class DocumentMerger(BaseComponent)
A node to merge the texts of the documents.
DocumentMerger.__init__
def __init__(separator: str = " ")
Arguments:
separator
: The separator that appears between subsequent merged documents.
DocumentMerger.merge
def merge(documents: List[Document],
separator: Optional[str] = None) -> List[Document]
Produce a list made up of a single document, which contains all the texts of the documents provided.
Arguments:
separator
: The separator that appears between subsequent merged documents.
Returns:
List of Documents
Module shaper
rename
def rename(value: Any) -> Any
An identity function. You can use it to rename values in the invocation context without changing them.
Example:
assert rename(1) == 1
value_to_list
def value_to_list(value: Any, target_list: List[Any]) -> List[Any]
Transforms a value into a list containing this value as many times as the length of the target list.
Example:
assert value_to_list(value=1, target_list=list(range(5))) == [1, 1, 1, 1, 1]
join_lists
def join_lists(lists: List[List[Any]]) -> List[Any]
Joins the lists you pass to it into a single list.
Example:
assert join_lists(lists=[[1, 2, 3], [4, 5]]) == [1, 2, 3, 4, 5]
join_strings
def join_strings(strings: List[str],
delimiter: str = " ",
str_replace: Optional[Dict[str, str]] = None) -> str
Transforms a list of strings into a single string. The content of this string is the content of all of the original strings separated by the delimiter you specify.
Example:
assert join_strings(strings=["first", "second", "third"], delimiter=" - ", str_replace={"r": "R"}) == "fiRst - second - thiRd"
format_string
def format_string(string: str,
str_replace: Optional[Dict[str, str]] = None) -> str
Replaces strings.
Example:
assert format_string(string="first", str_replace={"r": "R"}) == "fiRst"
join_documents
def join_documents(
documents: List[Document],
delimiter: str = " ",
pattern: Optional[str] = None,
str_replace: Optional[Dict[str, str]] = None) -> List[Document]
Transforms a list of documents into a list containing a single document. The content of this document
is the joined result of all original documents, separated by the delimiter you specify.
Use regex in the pattern
parameter to control how each document is represented.
You can use the following placeholders:
- $content: The content of the document.
- $idx: The index of the document in the list.
- $id: The ID of the document.
- $META_FIELD: The value of the metadata field called 'META_FIELD'.
All metadata is dropped.
Example:
assert join_documents(
documents=[
Document(content="first"),
Document(content="second"),
Document(content="third")
],
delimiter=" - ",
pattern="[$idx] $content",
str_replace={"r": "R"}
) == [Document(content="[1] fiRst - [2] second - [3] thiRd")]
join_documents_and_scores
def join_documents_and_scores(
documents: List[Document]) -> Tuple[List[Document]]
Transforms a list of documents with scores in their metadata into a list containing a single document. The resulting document contains the scores and the contents of all the original documents. All metadata is dropped.
Example:
assert join_documents_and_scores(
documents=[
Document(content="first", meta={"score": 0.9}),
Document(content="second", meta={"score": 0.7}),
Document(content="third", meta={"score": 0.5})
],
delimiter=" - "
) == ([Document(content="-[0.9] first\n -[0.7] second\n -[0.5] third")], )
format_document
def format_document(document: Document,
pattern: Optional[str] = None,
str_replace: Optional[Dict[str, str]] = None,
idx: Optional[int] = None) -> str
Transforms a document into a single string.
Use regex in the pattern
parameter to control how the document is represented.
You can use the following placeholders:
- $content: The content of the document.
- $idx: The index of the document in the list.
- $id: The ID of the document.
- $META_FIELD: The value of the metadata field called 'META_FIELD'.
Example:
assert format_document(
document=Document(content="first"),
pattern="prefix [$idx] $content",
str_replace={"r": "R"},
idx=1,
) == "prefix [1] fiRst"
format_answer
def format_answer(answer: Answer,
pattern: Optional[str] = None,
str_replace: Optional[Dict[str, str]] = None,
idx: Optional[int] = None) -> str
Transforms an answer into a single string.
Use regex in the pattern
parameter to control how the answer is represented.
You can use the following placeholders:
- $answer: The answer text.
- $idx: The index of the answer in the list.
- $META_FIELD: The value of the metadata field called 'META_FIELD'.
Example:
assert format_answer(
answer=Answer(answer="first"),
pattern="prefix [$idx] $answer",
str_replace={"r": "R"},
idx=1,
) == "prefix [1] fiRst"
join_documents_to_string
def join_documents_to_string(
documents: List[Document],
delimiter: str = " ",
pattern: Optional[str] = None,
str_replace: Optional[Dict[str, str]] = None) -> str
Transforms a list of documents into a single string. The content of this string
is the joined result of all original documents separated by the delimiter you specify.
Use regex in the pattern
parameter to control how the documents are represented.
You can use the following placeholders:
- $content: The content of the document.
- $idx: The index of the document in the list.
- $id: The ID of the document.
- $META_FIELD: The value of the metadata field called 'META_FIELD'.
Example:
assert join_documents_to_string(
documents=[
Document(content="first"),
Document(content="second"),
Document(content="third")
],
delimiter=" - ",
pattern="[$idx] $content",
str_replace={"r": "R"}
) == "[1] fiRst - [2] second - [3] thiRd"
strings_to_answers
def strings_to_answers(
strings: List[str],
prompts: Optional[List[Union[str, List[Dict[str, str]]]]] = None,
documents: Optional[List[Document]] = None,
pattern: Optional[str] = None,
reference_pattern: Optional[str] = None,
reference_mode: Literal["index", "id", "meta"] = "index",
reference_meta_field: Optional[str] = None) -> List[Answer]
Transforms a list of strings into a list of answers.
Specify reference_pattern
to populate the answer's document_ids
by extracting document references from the strings.
Arguments:
strings
: The list of strings to transform.prompts
: The prompts used to generate the answers.documents
: The documents used to generate the answers.pattern
: The regex pattern to use for parsing the answer. Examples:[^\\n]+$
will find "this is an answer" in string "this is an argument.\nthis is an answer".Answer: (.*)
will find "this is an answer" in string "this is an argument. Answer: this is an answer". If None, the whole string is used as the answer. If not None, the first group of the regex is used as the answer. If there is no group, the whole match is used as the answer.reference_pattern
: The regex pattern to use for parsing the document references. Example:\\[(\\d+)\\]
will find "1" in string "this is an answer[1]". If None, no parsing is done and all documents are referenced.reference_mode
: The mode used to reference documents. Supported modes are:- index: the document references are the one-based index of the document in the list of documents. Example: "this is an answer[1]" will reference the first document in the list of documents.
- id: the document references are the document IDs. Example: "this is an answer[123]" will reference the document with id "123".
- meta: the document references are the value of a metadata field of the document. Example: "this is an answer[123]" will reference the document with the value "123" in the metadata field specified by reference_meta_field.
reference_meta_field
: The name of the metadata field to use for document references in reference_mode "meta".
Returns:
The list of answers. Examples:
Without reference parsing:
assert strings_to_answers(strings=["first", "second", "third"], prompt="prompt", documents=[Document(id="123", content="content")]) == [
Answer(answer="first", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
Answer(answer="second", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
Answer(answer="third", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
]
With reference parsing:
assert strings_to_answers(strings=["first[1]", "second[2]", "third[1][3]"], prompt="prompt",
documents=[Document(id="123", content="content"), Document(id="456", content="content"), Document(id="789", content="content")],
reference_pattern=r"\\[(\\d+)\\]",
reference_mode="index"
) == [
Answer(answer="first", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
Answer(answer="second", type="generative", document_ids=["456"], meta={"prompt": "prompt"}),
Answer(answer="third", type="generative", document_ids=["123", "789"], meta={"prompt": "prompt"}),
]
string_to_answer
def string_to_answer(string: str,
prompt: Optional[Union[str, List[Dict[str, str]]]],
documents: Optional[List[Document]],
pattern: Optional[str] = None,
reference_pattern: Optional[str] = None,
reference_mode: Literal["index", "id", "meta"] = "index",
reference_meta_field: Optional[str] = None) -> Answer
Transforms a string into an answer.
Specify reference_pattern
to populate the answer's document_ids
by extracting document references from the string.
Arguments:
string
: The string to transform.prompt
: The prompt used to generate the answer.documents
: The documents used to generate the answer.pattern
: The regex pattern to use for parsing the answer. Examples:[^\\n]+$
will find "this is an answer" in string "this is an argument.\nthis is an answer".Answer: (.*)
will find "this is an answer" in string "this is an argument. Answer: this is an answer". If None, the whole string is used as the answer. If not None, the first group of the regex is used as the answer. If there is no group, the whole match is used as the answer.reference_pattern
: The regex pattern to use for parsing the document references. Example:\\[(\\d+)\\]
will find "1" in string "this is an answer[1]". If None, no parsing is done and all documents are referenced.reference_mode
: The mode used to reference documents. Supported modes are:- index: the document references are the one-based index of the document in the list of documents. Example: "this is an answer[1]" will reference the first document in the list of documents.
- id: the document references are the document IDs. Example: "this is an answer[123]" will reference the document with id "123".
- meta: the document references are the value of a metadata field of the document. Example: "this is an answer[123]" will reference the document with the value "123" in the metadata field specified by reference_meta_field.
reference_meta_field
: The name of the metadata field to use for document references in reference_mode "meta".
Returns:
The answer
parse_references
def parse_references(
string: str,
reference_pattern: Optional[str] = None,
candidates: Optional[Dict[str, str]] = None) -> Optional[List[str]]
Parses an answer string for document references and returns the document IDs of the referenced documents.
Arguments:
string
: The string to parse.reference_pattern
: The regex pattern to use for parsing the document references. Example:\\[(\\d+)\\]
will find "1" in string "this is an answer[1]". If None, no parsing is done and all candidate document IDs are returned.candidates
: A dictionary of candidates to choose from. The keys are the reference strings and the values are the document IDs. If None, no parsing is done and None is returned.
Returns:
A list of document IDs.
answers_to_strings
def answers_to_strings(
answers: List[Answer],
pattern: Optional[str] = None,
str_replace: Optional[Dict[str, str]] = None) -> List[str]
Extracts the content field of answers and returns a list of strings.
Example:
assert answers_to_strings(
answers=[
Answer(answer="first"),
Answer(answer="second"),
Answer(answer="third")
],
pattern="[$idx] $answer",
str_replace={"r": "R"}
) == ["[1] fiRst", "[2] second", "[3] thiRd"]
strings_to_documents
def strings_to_documents(
strings: List[str],
meta: Union[List[Optional[Dict[str, Any]]],
Optional[Dict[str, Any]]] = None,
id_hash_keys: Optional[List[str]] = None) -> List[Document]
Transforms a list of strings into a list of documents. If you pass the metadata in a single
dictionary, all documents get the same metadata. If you pass the metadata as a list, the length of this list
must be the same as the length of the list of strings, and each document gets its own metadata.
You can specify id_hash_keys
only once and it gets assigned to all documents.
Example:
assert strings_to_documents(
strings=["first", "second", "third"],
meta=[{"position": i} for i in range(3)],
id_hash_keys=['content', 'meta]
) == [
Document(content="first", metadata={"position": 1}, id_hash_keys=['content', 'meta])]),
Document(content="second", metadata={"position": 2}, id_hash_keys=['content', 'meta]),
Document(content="third", metadata={"position": 3}, id_hash_keys=['content', 'meta])
]
documents_to_strings
def documents_to_strings(
documents: List[Document],
pattern: Optional[str] = None,
str_replace: Optional[Dict[str, str]] = None) -> List[str]
Extracts the content field of documents and returns a list of strings. Use regext in the pattern
parameter to control how the documents are represented.
Example:
assert documents_to_strings(
documents=[
Document(content="first"),
Document(content="second"),
Document(content="third")
],
pattern="[$idx] $content",
str_replace={"r": "R"}
) == ["[1] fiRst", "[2] second", "[3] thiRd"]
Shaper
class Shaper(BaseComponent)
Shaper is a component that can invoke arbitrary, registered functions on the invocation context (query, documents, and so on) of a pipeline. It then passes the new or modified variables further down the pipeline.
Using YAML configuration, the Shaper component is initialized with functions to invoke on pipeline invocation context.
For example, in the YAML snippet below:
components:
- name: shaper
type: Shaper
params:
func: value_to_list
inputs:
value: query
target_list: documents
output: [questions]
the Shaper component is initialized with a directive to invoke function expand on the variable query and to store the result in the invocation context variable questions. All other invocation context variables are passed down the pipeline as they are.
You can use multiple Shaper components in a pipeline to modify the invocation context as needed.
Currently, Shaper
supports the following functions:
rename
value_to_list
join_lists
join_strings
format_string
join_documents
join_documents_and_scores
format_document
format_answer
join_documents_to_string
strings_to_answers
string_to_answer
parse_references
answers_to_strings
join_lists
strings_to_documents
documents_to_strings
See their descriptions in the code for details about their inputs, outputs, and other parameters.
Shaper.__init__
def __init__(func: str,
outputs: List[str],
inputs: Optional[Dict[str, Union[List[str], str]]] = None,
params: Optional[Dict[str, Any]] = None,
publish_outputs: Union[bool, List[str]] = True)
Initializes the Shaper component.
Some examples:
- name: shaper
type: Shaper
params:
func: value_to_list
inputs:
value: query
target_list: documents
outputs:
- questions
This node takes the content of query
and creates a list that contains the value of query
len(documents)
times.
This list is stored in the invocation context under the key questions
.
- name: shaper
type: Shaper
params:
func: join_documents
inputs:
value: documents
params:
delimiter: ' - '
outputs:
- documents
This node overwrites the content of documents
in the invocation context with a list containing a single Document
whose content is the concatenation of all the original Documents. So if documents
contained
[Document("A"), Document("B"), Document("C")]
, this shaper overwrites it with [Document("A - B - C")]
- name: shaper
type: Shaper
params:
func: join_strings
params:
strings: ['a', 'b', 'c']
delimiter: ' . '
outputs:
- single_string
- name: shaper
type: Shaper
params:
func: strings_to_documents
inputs:
strings: single_string
metadata:
name: 'my_file.txt'
outputs:
- single_document
These two nodes, executed one after the other, first add a key in the invocation context called single_string
that contains a . b . c
, and then create another key called single_document
that contains instead
[Document(content="a . b . c", metadata={'name': 'my_file.txt'})]
.
Arguments:
func
: The function to apply.inputs
: Maps the function's input kwargs to the key-value pairs in the invocation context. For example,value_to_list
expects thevalue
andtarget_list
parameters, soinputs
might contain:{'value': 'query', 'target_list': 'documents'}
. It doesn't need to contain all keyword args, seeparams
.params
: Maps the function's input kwargs to some fixed values. For example,value_to_list
expectsvalue
andtarget_list
parameters, soparams
might contain{'value': 'A', 'target_list': [1, 1, 1, 1]}
and the node's output is["A", "A", "A", "A"]
. It doesn't need to contain all keyword args, seeinputs
. You can use params to provide fallback values for arguments ofrun
that you're not sure exist. So if you needquery
to exist, you can provide a fallback value in the params, which will be used only ifquery
is not passed to this node by the pipeline.outputs
: The key to store the outputs in the invocation context. The length of the outputs must match the number of outputs produced by the function invoked.publish_outputs
: Controls whether to publish the outputs to the pipeline's output. SetTrue
(default value) to publishes all outputs orFalse
to publish None. E.g. ifoutputs = ["documents"]
result forpublish_outputs = True
looks like
{
"invocation_context": {
"documents": [...]
},
"documents": [...]
}
For publish_outputs = False
result looks like
{
"invocation_context": {
"documents": [...]
},
}
If you want to have finer-grained control, pass a list of the outputs you want to publish.