DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Ready-Made Pipelines

Haystack comes with a number of predefined pipelines that fit most standard search patterns, allowing you to build a QA system in no time. Here's a list of the out-of-the-box pipelines Haystack offers.

πŸ“˜

Pipelines as Agent tools

You can use the following ready-made pipelines as tools for your agent:

  • WebQAPipeline
  • ExtractiveQAPipeline
  • DocumentSearchPipeline
  • GenerativeQAPipeline
  • SearchSummarizationPipeline
  • FAQPipeline
  • TranslationWrapperPipeline
  • QuestionGenerationPipeline

ExtractiveQAPipeline

Extractive QA is the task of finding the answer to a question in a set of documents by selecting a segment of text. The ExtractiveQAPipeline combines the Retriever and the Reader such that:

  • The Retriever combs through a database and returns only the Documents it thinks are the most relevant to the query.
  • The Reader accepts the Documents the Retriever returns and selects a text span as the answer to the query.
pipeline = ExtractiveQAPipeline(reader, retriever)

query = "What is Hagrid's dog's name?"
result = pipeline.run(query=query, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 1}})

The Underlying Pipeline Structure

Here's what the pipeline looks like under the hood:

pipeline = Pipeline()

pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

result = pipeline.run(query=query, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 1}})

The output of the pipeline is a Python dictionary with a list of Answer objects stored under the answers key. These provide additional information such as the context from which the Answer was extracted and the model’s confidence in the accuracy of the extracted Answer.

You can use the print_answers() function to cleanly print the output of the pipeline.

from haystack.utils import print_answers

print_answers(result, details="all", max_text_len=100)
[   <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9946763813495636, 'context': "s Nymeria after a legendary warrior queen. She travels...", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
    <Answer {'answer': 'King Robert', 'type': 'extractive', 'score': 0.9251320660114288, 'context': 'ordered by the Lord of Light. Melisandre later reveals to Gendry that...', 'offsets_in_document': [{'start': 1808, 'end': 1819}], 'offsets_in_context': [{'start': 70, 'end': 81}], 'document_id': '7b67b0e27571c2b2025a34b4db18ad49', 'meta': {'name': '349_List_of_Game_of_Thrones_characters.txt'}}>,
    <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.8103329539299011, 'context': " girl disguised as a boy all along and is surprised to learn she is Arya...", 'offsets_in_document': [{'start': 920, 'end': 923}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_id': '7b67b0e27571c2b2025a34b4db18ad49', 'meta': {'name': '349_List_of_Game_of_Thrones_characters.txt'}}>,
    ...
]

Another option is to convert the Answers to dictionaries before printing.

[x.to_dict() for x in result["answers"]]

>>> [{'answer': 'Eddard',
  'context': 's Nymeria after a legendary warrior queen. She travels with her '
             "father, Eddard, to King's Landing when he is made Hand of the "
             'King. Before she leaves,',
  'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f',
  'meta': {'name': '43_Arya_Stark.txt'},
  'offsets_in_context': [{'end': 78, 'start': 72}],
  'offsets_in_document': [{'end': 153, 'start': 147}],
  'score': 0.9946763813495636,
  'type': 'extractive'},
  ...]

To learn how to use this pipeline, check out our tutorials Build Your First QA System or Build a QA System Without Elasticsearch.

DocumentSearchPipeline

We typically pass the output of the Retriever to another component, such as the Reader or the Generator. However, we can use the Retriever for semantic document search to find the Documents most relevant to our query.

DocumentSearchPipeline wraps the Retriever into a pipeline. Note that this wrapper does not endow the Retrievers with additional functionality but instead allows them to be used consistently with other Haystack pipeline objects and with the same familiar syntax. To create this pipeline, pass the Retriever into the pipeline’s constructor:

pipeline = DocumentSearchPipeline(retriever)
query = "Tell me something about that time when they play chess."
result = pipeline.run(query, params={"Retriever": {"top_k": 2}})

The Underlying Pipeline Structure

Here's the pipeline under the hood:

pipeline = Pipeline()

pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])

result = pipeline.run(query, params={"Retriever": {"top_k": 2}})

The pipeline returns a list of Document objects under the document key.

You can use the print_documents() function to cleanly print the output of the pipeline.

from haystack.utils import print_documents

print_documents(result, max_text_len=100, print_name=True, print_meta=True)
Query: Arya Stark father

{   'content': '\n'
               '===On the Kingsroad===\n'
               'City Watchmen search the caravan for Gendry but are turned '
               'away by Yoren. Ge...',
    'meta': {'name': '224_The_Night_Lands.txt'},
    'name': '224_The_Night_Lands.txt'}
...

Another option is to convert the Documents to dictionaries before printing.

[x.to_dict() for x in result["documents"]]
>>> [{'content': '\n'
             '===On the Kingsroad===\n'
             'City Watchmen search the caravan for Gendry but are turned away '
             'by Yoren. Gendry tells Arya Stark that he knows she is a girl, '
             'and she reveals she is actually Arya Stark after learning that '
             'her father met Gendry before he was executed.',
  'content_type': 'text',
  'embedding': None,
  'id': 'a4d2cc51d351b785c6effddd3345bb39',
  'meta': {'name': '224_The_Night_Lands.txt'},
  'score': 0.7827358902378247}},
  ...]

GenerativeQAPipeline

Unlike extractive QA, which produces an answer by extracting a text span from a collection of passages, generative QA works by producing free text answers that need not correspond to a span of any Document. Because the Answers are not constrained by text spans, the Generator is able to create Answers that are more appropriately worded compared to those extracted by the Reader. Therefore, it makes sense to employ a generative QA system if you expect answers to extend over multiple text spans, or if you expect answers to not be contained verbatim in the documents.

GenerativeQAPipeline combines the Retriever with the PromptNode. To create an Answer, the PromptNode uses the internal factual knowledge stored in the language model’s parameters and the external knowledge provided by the Retriever’s output.

You can build a GenerativeQAPipeline by simply placing the individual components inside the pipeline’s constructor:

pipeline = GenerativeQAPipeline(prompt_node=prompt_node, retriever=retriever)

result = pipeline.run(query="Who opened the Chamber of Secrets?", params={"Retriever": {"top_k": 10}, "generator": {"top_k": 1}})

The Underlying Pipeline Structure

Here's what the pipeline looks like under the hood:

pipeline = Pipeline()

pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=prompt_node, name="Prompt_Node", inputs=["Retriever])

result = pipeline.run(query="Who opened the Chamber of Secrets?", params={"Retriever": {"top_k": 10}, "prompt_node": {"top_k": 1}})

The output of the pipeline is a Python dictionary with a list of dictionaries stored under the answers key. These provide additional information, such as the context from which the Answer was extracted and the model’s confidence in the accuracy of the generated Answer.

You can use the print_answers() function to cleanly print the output of the pipeline.

from haystack.utils import print_answers

print_answers(result, details="all", max_text_len=100)
{
    'answer': ' Cersei lannister',
    'query': "Who is Tyrion's sister?",
    'meta': {   'content': [   '\n'
                               '==Lyrics==\n'
                               'The title of the song is a line spoken by '
                               'the character Cersei Lannister in the HBO '
                               '],
                'doc_ids': [   '3280fffdf5e01837a118d0b8b12130d0',
                               '71a783f2734f7e88ed548076e4889bb7',
                               '71a783f2734f7e88ed548076e4889bb7'],
                'doc_scores': [   0.6617550197363464,
                                  0.6361380356314655,
                                  0.6007305510447117],
                'titles': [   '401_Power_Is_Power.txt',
                              '145_Elio_M._GarcΓ­a_Jr._and_Linda_Antonsson.txt',
                              '145_Elio_M._GarcΓ­a_Jr._and_Linda_Antonsson.txt']}
}

For more examples of using GenerativeQAPipeline, check out our tutorials where we implement generative QA systems with RAG and LFQA.

WebQAPipeline

This is another generative question answering pipeline, however, it differs from GenerativeQAPipeline in that it performs generative question answering based on Documents from the web search engine.

This pipeline combines the WebRetriever that retrieves Documents from the web with the PromptNode that generates the answer based on the web documents and a Shaper that makes sure the PromptNode can ingest the documents.

The TopPSampler in the pipeline calculates the similarity score for documents it receives from the WebRetriever and adds this score to each document's metadata. The PromptNode then uses both the contents and the scores of the documents to generate an answer.

The Underlying Pipeline Structure

Here's what this pipeline's code looks like:


pipeline = Pipeline()

pipeline.add_node(component=retriever, name="Retriever", inputs=["Query'])
pipeline.add_node(component=sampler, name="Sampler", inputs=["Retriever"])
pipeline.add_node(component=shaper, name="Shaper", inputs=["Sampler"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Shaper"])
pipeline.metrics_filter = {"Retriever": ["recall_single_hit"]}

result = pipeline.run(query="Why do airplanes leave contrails in the sky?", params={"Retriever": {"top_k": 3}, "Sampler": {"top_p": 0.8}})

SearchSummarizationPipeline

Summarizer helps make sense of the Retriever’s output by creating a summary of the retrieved documents. This is useful for performing a quick sanity check and confirming the quality of candidate documents suggested by the Retriever, without having to inspect each document individually. Depending on whether you set the generate_single_summary to True or False, the output will either be a single summary of all documents or one summary per document.

SearchSummarizationPipeline combines the Retriever with the Summarizer. Below is an example of an implementation.

pipeline = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever, generate_single_summary=True)

result = pipeline.run(query="Describe Luna Lovegood.", params={"Retriever": {"top_k": 5}})

The Underlying Pipeline Structure

Here's the pipeline under the hood:

pipeline = Pipeline()

pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=document_merger, name="Document Merger", inputs=["Retriever"])
pipeline.add_node(component=summarizer, name="Summarizer", inputs=["Document Merger"])

result = pipeline.run(query="Describe Luna Lovegood.", params={"Retriever": {"top_k": 5}})

Under the documents key of the output, there's a list of Document objects. See the DocumentSearchPipeline on how to best print the output.

result['documents']
>>> [{'text': "Luna Lovegood is the only known member of the Lovegood family whose first name is not of Greek origin, rather it is of Latin origin. Her nickname, 'Loony,' refers to the moon and its ties with insanity, as it is short for 'lunatic' she is the goddess of the moon, hunting, the wilderness and the gift of taming wild animals.",
...}]

TextIndexingPipeline

To use text files in a Haystack pipeline, convert them to the Document type, and index them into a DocumentStore. TextIndexingPipeline takes your text files as input, preprocesses them, and writes the into the DocumentStore you specify.

TextIndexingPipeline combines the TextConverter with the PreProcessor and the DocumentStore.

pipeline = TextIndexingPipeline(document_store, text_converter, preprocessor)

result = pipeline.run(file_path="my_text_file.txt")

The Underlying Pipeline Structure

Here's the pipeline under the hood:

pipeline = Pipeline()

pipeline.add_node(component=self.text_converter, name="TextConverter", inputs=["File"])
pipeline.add_node(component=self.preprocessor, name="PreProcessor", inputs=["TextConverter"])
pipeline.add_node(component=self.document_store, name="DocumentStore", inputs=["PreProcessor"])

result = pipeline.run(file_path="my_text_file.txt")

TranslationWrapperPipeline

Translator components bring the power of machine translation into your QA systems. Say your knowledge base is in English but the majority of your user base speaks German. With a TranslationWrapperPipeline, you can chain together:

  • The Translator, which translates a query source into a target language (e.g. German into English)
  • A search pipeline, such as ExtractiveQAPipeline or DocumentSearchPipeline, which executes the translated query against a knowledge base.
  • Another Translator that translates the search pipeline's results from the target back into the source language (e.g. English into German)

After wrapping your search pipeline between two translation nodes, you can query it like you normally would, that is, by calling the run() method with a query in the desired language. Here’s an example of implementation:

pipeline = TranslationWrapperPipeline(input_translator=de_en_translator,
                                      output_translator=en_de_translator,
                                      pipeline=extractive_qa_pipeline)

query = "Was lΓ€sst den dreikΓΆpfigen Hund weiterschlafen?" # What keeps the three-headed dog asleep?

result = pipeline.run(query=query, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 1}})

For information on the output of this pipeline, refer to the documentation of the pipeline being wrapped.

FAQPipeline

FAQPipeline wraps the Retriever into a pipeline and allows it to be used for question answering with FAQ data. Compared to other types of question answering, FAQ-style QA is significantly faster. However, it can only answer FAQ-type questions because this type of QA matches queries against questions that already exist in your FAQ documents.

For this task, we recommend using the EmbeddingRetriever with a sentence similarity model such as sentence-transformers/all-MiniLM-L6-v2. Here’s an example of a FAQPipeline in action:

pipeline = FAQPipeline(retriever=retriever)
query = "How to reduce stigma around Covid-19?"
result = pipeline.run(query=query, params={"Retriever": {"top_k": 1}})

The Underlying Pipeline Structure

Here's the pipeline under the hood:

pipeline = Pipeline()

pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=Docs2Answers(), name="Docs2Answers", inputs=["Retriever"])

result = pipeline.run(query=query, params={"Retriever": {"top_k": 1}})

The pipeline output is a list of Answer objects under the answers key. You see the Document objects from which the pipeline is getting the answer but looking at the documents key of the pipeline output.

result["answer"]

result["documents"]

Check out the tutorial Utilizing existing FAQs for Question Answering for more information on FAQPipeline.

QuestionGenerationPipeline

The most basic version of a question generator pipeline takes a document as input and outputs generated questions that the Document can answer.

text1 = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."

question_generation_pipeline = QuestionGenerationPipeline(question_generator)
result = question_generation_pipeline.run(documents=[document])

The Underlying Pipeline Structure

Here's the pipeline under the hood:

pipeline = Pipeline()

pipeline.add_node(component=question_generator, name="QuestionGenerator", inputs=["Query"])

result = pipeline.run(documents=[document])

You can access the generated questions as follows.

result["generated_questions"]["questions"]

Output:

[' Who created Python?',
 ' When was Python first released?',
 " What is Python's design philosophy?"]

QuestionAnswerGenerationPipeline

This pipeline takes a Document as input, generates questions on it, and attempts to answer these questions using a Reader model.

qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
result = qag_pipeline.run(documents=[document])
print(result)

The Underlying Pipeline Structure

Here's the pipeline under the hood:

pipeline = Pipeline()

pipeline.add_node(component=question_generator, name="QuestionGenerator", inputs=["Query"])
pipeline.add_node(component=reader, name="Reader", inputs=["QuestionGenerator"])

result = pipeline.run(documents=[document])

Output:

{
 ...
 'query_doc_list': [{'docs': [{'text': "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first ...", ...}],
                     'queries': ' Who created Python?'},
                    ...],
 'results': [{'answers': [{<Answer: answer='Guido van Rossum', score=0.9950061142444611, context='eted, high-level, general-purpose programming lang...'>, ...],
              'no_ans_gap': 15.335145950317383,
              'query': ' Who created Python?'},
              ...
             ],
 ...
 }

MostSimilarDocumentsPipeline

This pipeline is used to find the most similar documents to a given document in your DocumentStore.

You will need to first make sure that your indexed documents have attached embeddings. You can generate and store their embeddings using the DocumentStore.update_embeddings() method.

from haystack.pipelines import MostSimilarDocumentsPipeline

msd_pipeline = MostSimilarDocumentsPipeline(document_store)
result = msd_pipeline.run(document_ids=[doc_id1, doc_id2, ...])

The output will be a list of Document objects.