DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

The utility classes of Haystack.

Module docs2answers

Docs2Answers

class Docs2Answers(BaseComponent)

This Node is used to convert retrieved documents into predicted answers format.

It is useful for situations where you are calling a Retriever only pipeline via REST API. This ensures that your output is in a compatible format.

Arguments:

  • progress_bar: Whether to show a progress bar

Module join_docs

JoinDocuments

class JoinDocuments(JoinNode)

A node to join documents outputted by multiple retriever nodes.

The node allows multiple join modes:

  • concatenate: combine the documents from multiple nodes. In case of duplicate documents, the one with the highest score is kept.
  • merge: merge scores of documents from multiple nodes. Optionally, each input score can be given a different weight & a top_k limit can be set. This mode can also be used for "reranking" retrieved documents.
  • reciprocal_rank_fusion: combines the documents based on their rank in multiple nodes.

JoinDocuments.__init__

def __init__(join_mode: str = "concatenate",
             weights: Optional[List[float]] = None,
             top_k_join: Optional[int] = None,
             sort_by_score: bool = True)

Arguments:

  • join_mode: concatenate to combine documents from multiple retrievers merge to aggregate scores of individual documents, reciprocal_rank_fusion to apply rank based scoring.
  • weights: A node-wise list(length of list must be equal to the number of input nodes) of weights for adjusting document scores when using the merge join_mode. By default, equal weight is given to each retriever score. This param is not compatible with the concatenate join_mode.
  • top_k_join: Limit documents to top_k based on the resulting scores of the join.
  • sort_by_score: Whether to sort the incoming documents by their score. Set this to True if all your Documents are coming with score values. Set to False if any of the Documents come from sources where the score is set to None, like TfidfRetriever on Elasticsearch.

Module join_answers

JoinAnswers

class JoinAnswers(JoinNode)

A node to join Answers produced by multiple Reader nodes.

JoinAnswers.__init__

def __init__(join_mode: str = "concatenate",
             weights: Optional[List[float]] = None,
             top_k_join: Optional[int] = None,
             sort_by_score: bool = True)

Arguments:

  • join_mode: "concatenate" to combine documents from multiple Readers. "merge" to aggregate scores of individual Answers.
  • weights: A node-wise list (length of list must be equal to the number of input nodes) of weights for adjusting Answer scores when using the "merge" join_mode. By default, equal weight is assigned to each Reader score. This parameter is not compatible with the "concatenate" join_mode.
  • top_k_join: Limit Answers to top_k based on the resulting scored of the join.
  • sort_by_score: Whether to sort the incoming answers by their score. Set this to True if your Answers are coming from a Reader or TableReader. Set to False if any Answers come from a Generator since this assigns None as a score to each.

Module route_documents

RouteDocuments

class RouteDocuments(BaseComponent)

A node to split a list of Documents by content_type or by the values of a metadata field and route them to different nodes.

RouteDocuments.__init__

def __init__(split_by: str = "content_type",
             metadata_values: Optional[Union[List[Union[str, bool, int]],
                                             List[List[Union[str, bool,
                                                             int]]]]] = None,
             return_remaining: bool = False)

Arguments:

  • split_by: Field to split the documents by, either "content_type" or a metadata field name. If this parameter is set to "content_type", the list of Documents will be split into a list containing only Documents of type "text" (will be routed to "output_1") and a list containing only Documents of type "table" (will be routed to "output_2"). If this parameter is set to a metadata field name, you need to specify the parameter metadata_values as well. :param metadata_values: A list of values to group Documents by metadata field. If the parameter split_by is set to a metadata field name, you must provide a list of values (or a list of lists of values) to group the Documents by. If metadata_values is a list of strings, then the Documents whose metadata field is equal to the corresponding value will be routed to the output with the same index. If metadata_values is a list of lists, then the Documents whose metadata field is equal to the first value of the provided sublist will be routed to "output_1", the Documents whose metadata field is equal to the second value of the provided sublist will be routed to "output_2", and so on.
  • return_remaining: Whether to return all remaining documents that don't match the split_by or metadata_values into an additional output route. This additional output route will be indexed to plus one of the previous last output route. For example, if there would normally be "output_1" and "output_2" when return_remaining is False, then when return_remaining is True the additional output route would be "output_3".

Module document_merger

DocumentMerger

class DocumentMerger(BaseComponent)

A node to merge the texts of the documents.

DocumentMerger.__init__

def __init__(separator: str = " ")

Arguments:

  • separator: The separator that appears between subsequent merged documents.

DocumentMerger.merge

def merge(documents: List[Document],
          separator: Optional[str] = None) -> List[Document]

Produce a list made up of a single document, which contains all the texts of the documents provided.

Arguments:

  • separator: The separator that appears between subsequent merged documents.

Returns:

List of Documents

Module shaper

rename

def rename(value: Any) -> Any

An identity function. You can use it to rename values in the invocation context without changing them.

Example:

assert rename(1) == 1

current_datetime

def current_datetime(format: str = "%H:%M:%S %d/%m/%y") -> str

Function that outputs the current time and/or date formatted according to the parameters.

Example:

assert current_datetime("%d.%m.%y %H:%M:%S") == 01.01.2023 12:30:10

value_to_list

def value_to_list(value: Any, target_list: List[Any]) -> List[Any]

Transforms a value into a list containing this value as many times as the length of the target list.

Example:

assert value_to_list(value=1, target_list=list(range(5))) == [1, 1, 1, 1, 1]

join_lists

def join_lists(lists: List[List[Any]]) -> List[Any]

Joins the lists you pass to it into a single list.

Example:

assert join_lists(lists=[[1, 2, 3], [4, 5]]) == [1, 2, 3, 4, 5]

join_strings

def join_strings(strings: List[str],
                 delimiter: str = " ",
                 str_replace: Optional[Dict[str, str]] = None) -> str

Transforms a list of strings into a single string. The content of this string is the content of all of the original strings separated by the delimiter you specify.

Example:

assert join_strings(strings=["first", "second", "third"], delimiter=" - ", str_replace={"r": "R"}) == "fiRst - second - thiRd"

format_string

def format_string(string: str,
                  str_replace: Optional[Dict[str, str]] = None) -> str

Replaces strings.

Example:

assert format_string(string="first", str_replace={"r": "R"}) == "fiRst"

join_documents

def join_documents(
        documents: List[Document],
        delimiter: str = " ",
        pattern: Optional[str] = None,
        str_replace: Optional[Dict[str, str]] = None) -> List[Document]

Transforms a list of documents into a list containing a single document. The content of this document is the joined result of all original documents, separated by the delimiter you specify. Use regex in the pattern parameter to control how each document is represented. You can use the following placeholders:

  • $content: The content of the document.
  • $idx: The index of the document in the list.
  • $id: The ID of the document.
  • $META_FIELD: The value of the metadata field called 'META_FIELD'.

All metadata is dropped.

Example:

assert join_documents(
    documents=[
        Document(content="first"),
        Document(content="second"),
        Document(content="third")
    ],
    delimiter=" - ",
    pattern="[$idx] $content",
    str_replace={"r": "R"}
) == [Document(content="[1] fiRst - [2] second - [3] thiRd")]

join_documents_and_scores

def join_documents_and_scores(
        documents: List[Document]) -> Tuple[List[Document]]

Transforms a list of documents with scores in their metadata into a list containing a single document. The resulting document contains the scores and the contents of all the original documents. All metadata is dropped.

Example:

assert join_documents_and_scores(
    documents=[
        Document(content="first", meta={"score": 0.9}),
        Document(content="second", meta={"score": 0.7}),
        Document(content="third", meta={"score": 0.5})
    ],
    delimiter=" - "
) == ([Document(content="-[0.9] first
-[0.7] second
-[0.5] third")], )

format_document

def format_document(document: Document,
                    pattern: Optional[str] = None,
                    str_replace: Optional[Dict[str, str]] = None,
                    idx: Optional[int] = None) -> str

Transforms a document into a single string. Use regex in the pattern parameter to control how the document is represented. You can use the following placeholders:

  • $content: The content of the document.
  • $idx: The index of the document in the list.
  • $id: The ID of the document.
  • $META_FIELD: The value of the metadata field called 'META_FIELD'.

Example:

assert format_document(
    document=Document(content="first"),
    pattern="prefix [$idx] $content",
    str_replace={"r": "R"},
    idx=1,
) == "prefix [1] fiRst"

format_answer

def format_answer(answer: Answer,
                  pattern: Optional[str] = None,
                  str_replace: Optional[Dict[str, str]] = None,
                  idx: Optional[int] = None) -> str

Transforms an answer into a single string. Use regex in the pattern parameter to control how the answer is represented. You can use the following placeholders:

  • $answer: The answer text.
  • $idx: The index of the answer in the list.
  • $META_FIELD: The value of the metadata field called 'META_FIELD'.

Example:

assert format_answer(
    answer=Answer(answer="first"),
    pattern="prefix [$idx] $answer",
    str_replace={"r": "R"},
    idx=1,
) == "prefix [1] fiRst"

join_documents_to_string

def join_documents_to_string(
        documents: List[Document],
        delimiter: str = " ",
        pattern: Optional[str] = None,
        str_replace: Optional[Dict[str, str]] = None) -> str

Transforms a list of documents into a single string. The content of this string is the joined result of all original documents separated by the delimiter you specify. Use regex in the pattern parameter to control how the documents are represented. You can use the following placeholders:

  • $content: The content of the document.
  • $idx: The index of the document in the list.
  • $id: The ID of the document.
  • $META_FIELD: The value of the metadata field called 'META_FIELD'.

Example:

assert join_documents_to_string(
    documents=[
        Document(content="first"),
        Document(content="second"),
        Document(content="third")
    ],
    delimiter=" - ",
    pattern="[$idx] $content",
    str_replace={"r": "R"}
) == "[1] fiRst - [2] second - [3] thiRd"

strings_to_answers

def strings_to_answers(
        strings: List[str],
        prompts: Optional[List[Union[str, List[Dict[str, str]]]]] = None,
        documents: Optional[List[Document]] = None,
        pattern: Optional[str] = None,
        reference_pattern: Optional[str] = None,
        reference_mode: Literal["index", "id", "meta"] = "index",
        reference_meta_field: Optional[str] = None) -> List[Answer]

Transforms a list of strings into a list of answers.

Specify reference_pattern to populate the answer's document_ids by extracting document references from the strings.

:param strings: The list of strings to transform.
:param prompts: The prompts used to generate the answers.
:param documents: The documents used to generate the answers.
:param pattern: The regex pattern to use for parsing the answer.
    Examples:
        `[^\n]+$` will find "this is an answer" in string "this is an argument.

this is an answer". Answer: (.*) will find "this is an answer" in string "this is an argument. Answer: this is an answer". If None, the whole string is used as the answer. If not None, the first group of the regex is used as the answer. If there is no group, the whole match is used as the answer. :param reference_pattern: The regex pattern to use for parsing the document references. Example: \[(\d+)\] will find "1" in string "this is an answer[1]". If None, no parsing is done and all documents are referenced. :param reference_mode: The mode used to reference documents. Supported modes are: - index: the document references are the one-based index of the document in the list of documents. Example: "this is an answer[1]" will reference the first document in the list of documents. - id: the document references are the document IDs. Example: "this is an answer[123]" will reference the document with id "123". - meta: the document references are the value of a metadata field of the document. Example: "this is an answer[123]" will reference the document with the value "123" in the metadata field specified by reference_meta_field. :param reference_meta_field: The name of the metadata field to use for document references in reference_mode "meta". :return: The list of answers.

Examples:

Without reference parsing:
```python
assert strings_to_answers(strings=["first", "second", "third"], prompt="prompt", documents=[Document(id="123", content="content")]) == [
        Answer(answer="first", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
        Answer(answer="second", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
        Answer(answer="third", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
    ]
```

With reference parsing:
```python
assert strings_to_answers(strings=["first[1]", "second[2]", "third[1][3]"], prompt="prompt",
        documents=[Document(id="123", content="content"), Document(id="456", content="content"), Document(id="789", content="content")],
        reference_pattern=r"\[(\d+)\]",
        reference_mode="index"
    ) == [
        Answer(answer="first", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
        Answer(answer="second", type="generative", document_ids=["456"], meta={"prompt": "prompt"}),
        Answer(answer="third", type="generative", document_ids=["123", "789"], meta={"prompt": "prompt"}),
    ]
```

string_to_answer

def string_to_answer(string: str,
                     prompt: Optional[Union[str, List[Dict[str, str]]]],
                     documents: Optional[List[Document]],
                     pattern: Optional[str] = None,
                     reference_pattern: Optional[str] = None,
                     reference_mode: Literal["index", "id", "meta"] = "index",
                     reference_meta_field: Optional[str] = None) -> Answer

Transforms a string into an answer.

Specify reference_pattern to populate the answer's document_ids by extracting document references from the string.

:param string: The string to transform.
:param prompt: The prompt used to generate the answer.
:param documents: The documents used to generate the answer.
:param pattern: The regex pattern to use for parsing the answer.
    Examples:
        `[^\n]+$` will find "this is an answer" in string "this is an argument.

this is an answer". Answer: (.*) will find "this is an answer" in string "this is an argument. Answer: this is an answer". If None, the whole string is used as the answer. If not None, the first group of the regex is used as the answer. If there is no group, the whole match is used as the answer. :param reference_pattern: The regex pattern to use for parsing the document references. Example: \[(\d+)\] will find "1" in string "this is an answer[1]". If None, no parsing is done and all documents are referenced. :param reference_mode: The mode used to reference documents. Supported modes are: - index: the document references are the one-based index of the document in the list of documents. Example: "this is an answer[1]" will reference the first document in the list of documents. - id: the document references are the document IDs. Example: "this is an answer[123]" will reference the document with id "123". - meta: the document references are the value of a metadata field of the document. Example: "this is an answer[123]" will reference the document with the value "123" in the metadata field specified by reference_meta_field. :param reference_meta_field: The name of the metadata field to use for document references in reference_mode "meta". :return: The answer

parse_references

def parse_references(
        string: str,
        reference_pattern: Optional[str] = None,
        candidates: Optional[Dict[str, str]] = None) -> Optional[List[str]]

Parses an answer string for document references and returns the document IDs of the referenced documents.

Arguments:

  • string: The string to parse.
  • reference_pattern: The regex pattern to use for parsing the document references. Example: \[(\d+)\] will find "1" in string "this is an answer[1]". If None, no parsing is done and all candidate document IDs are returned.
  • candidates: A dictionary of candidates to choose from. The keys are the reference strings and the values are the document IDs. If None, no parsing is done and None is returned.

Returns:

A list of document IDs.

answers_to_strings

def answers_to_strings(
        answers: List[Answer],
        pattern: Optional[str] = None,
        str_replace: Optional[Dict[str, str]] = None) -> List[str]

Extracts the content field of answers and returns a list of strings.

Example:

assert answers_to_strings(
        answers=[
            Answer(answer="first"),
            Answer(answer="second"),
            Answer(answer="third")
        ],
        pattern="[$idx] $answer",
        str_replace={"r": "R"}
    ) == ["[1] fiRst", "[2] second", "[3] thiRd"]

strings_to_documents

def strings_to_documents(
        strings: List[str],
        meta: Union[List[Optional[Dict[str, Any]]],
                    Optional[Dict[str, Any]]] = None,
        id_hash_keys: Optional[List[str]] = None) -> List[Document]

Transforms a list of strings into a list of documents. If you pass the metadata in a single dictionary, all documents get the same metadata. If you pass the metadata as a list, the length of this list must be the same as the length of the list of strings, and each document gets its own metadata. You can specify id_hash_keys only once and it gets assigned to all documents.

Example:

assert strings_to_documents(
        strings=["first", "second", "third"],
        meta=[{"position": i} for i in range(3)],
        id_hash_keys=['content', 'meta]
    ) == [
        Document(content="first", metadata={"position": 1}, id_hash_keys=['content', 'meta])]),
        Document(content="second", metadata={"position": 2}, id_hash_keys=['content', 'meta]),
        Document(content="third", metadata={"position": 3}, id_hash_keys=['content', 'meta])
    ]

documents_to_strings

def documents_to_strings(
        documents: List[Document],
        pattern: Optional[str] = None,
        str_replace: Optional[Dict[str, str]] = None) -> List[str]

Extracts the content field of documents and returns a list of strings. Use regext in the pattern parameter to control how the documents are represented.

Example:

assert documents_to_strings(
        documents=[
            Document(content="first"),
            Document(content="second"),
            Document(content="third")
        ],
        pattern="[$idx] $content",
        str_replace={"r": "R"}
    ) == ["[1] fiRst", "[2] second", "[3] thiRd"]

Shaper

class Shaper(BaseComponent)

Shaper is a component that can invoke arbitrary, registered functions on the invocation context (query, documents, and so on) of a pipeline. It then passes the new or modified variables further down the pipeline.

Using YAML configuration, the Shaper component is initialized with functions to invoke on pipeline invocation context.

For example, in the YAML snippet below:

    components:
    - name: shaper
      type: Shaper
      params:
        func: value_to_list
        inputs:
            value: query
            target_list: documents
        output: [questions]

the Shaper component is initialized with a directive to invoke function expand on the variable query and to store the result in the invocation context variable questions. All other invocation context variables are passed down the pipeline as they are.

You can use multiple Shaper components in a pipeline to modify the invocation context as needed.

Currently, Shaper supports the following functions:

  • rename
  • value_to_list
  • join_lists
  • join_strings
  • format_string
  • join_documents
  • join_documents_and_scores
  • format_document
  • format_answer
  • join_documents_to_string
  • strings_to_answers
  • string_to_answer
  • parse_references
  • answers_to_strings
  • join_lists
  • strings_to_documents
  • documents_to_strings

See their descriptions in the code for details about their inputs, outputs, and other parameters.

Shaper.__init__

def __init__(func: str,
             outputs: List[str],
             inputs: Optional[Dict[str, Union[List[str], str]]] = None,
             params: Optional[Dict[str, Any]] = None,
             publish_outputs: Union[bool, List[str]] = True)

Initializes the Shaper component.

Some examples:

- name: shaper
  type: Shaper
  params:
  func: value_to_list
  inputs:
    value: query
    target_list: documents
  outputs:
    - questions

This node takes the content of query and creates a list that contains the value of query len(documents) times. This list is stored in the invocation context under the key questions.

- name: shaper
  type: Shaper
  params:
  func: join_documents
  inputs:
    value: documents
  params:
    delimiter: ' - '
  outputs:
    - documents

This node overwrites the content of documents in the invocation context with a list containing a single Document whose content is the concatenation of all the original Documents. So if documents contained [Document("A"), Document("B"), Document("C")], this shaper overwrites it with [Document("A - B - C")]

- name: shaper
  type: Shaper
  params:
  func: join_strings
  params:
    strings: ['a', 'b', 'c']
    delimiter: ' . '
  outputs:
    - single_string

- name: shaper
  type: Shaper
  params:
  func: strings_to_documents
  inputs:
    strings: single_string
    metadata:
      name: 'my_file.txt'
  outputs:
    - single_document

These two nodes, executed one after the other, first add a key in the invocation context called single_string that contains a . b . c, and then create another key called single_document that contains instead [Document(content="a . b . c", metadata={'name': 'my_file.txt'})].

Arguments:

  • func: The function to apply.
  • inputs: Maps the function's input kwargs to the key-value pairs in the invocation context. For example, value_to_list expects the value and target_list parameters, so inputs might contain: {'value': 'query', 'target_list': 'documents'}. It doesn't need to contain all keyword args, see params.
  • params: Maps the function's input kwargs to some fixed values. For example, value_to_list expects value and target_list parameters, so params might contain {'value': 'A', 'target_list': [1, 1, 1, 1]} and the node's output is ["A", "A", "A", "A"]. It doesn't need to contain all keyword args, see inputs. You can use params to provide fallback values for arguments of run that you're not sure exist. So if you need query to exist, you can provide a fallback value in the params, which will be used only if query is not passed to this node by the pipeline.
  • outputs: The key to store the outputs in the invocation context. The length of the outputs must match the number of outputs produced by the function invoked.
  • publish_outputs: Controls whether to publish the outputs to the pipeline's output. Set True (default value) to publishes all outputs or False to publish None. E.g. if outputs = ["documents"] result for publish_outputs = True looks like
    {
        "invocation_context": {
            "documents": [...]
        },
        "documents": [...]
    }

For publish_outputs = False result looks like

    {
        "invocation_context": {
            "documents": [...]
        },
    }

If you want to have finer-grained control, pass a list of the outputs you want to publish.