ExtractiveReader
Use this component in extractive question answering pipelines based on a query and a list of documents.
Most common position in a pipeline | In query pipelines, after a component that returns a list of documents, such as a Retriever |
Mandatory init variables | "token": The Hugging Face API token. Can be set with HF_API_TOKEN or HF_TOKEN env var. |
Mandatory run variables | "documents": A list of documents "query": A query string |
Output variables | "answers": A list of ExtractedAnswer objects |
API reference | Readers |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/readers/extractive.py |
Overview
ExtractiveReader
locates and extracts answers to a given query from the document text. It's used in extractive QA systems where you want to know exactly where the answer is located within the document. It's usually coupled with a Retriever that precedes it, but you can also use it with other components that fetch documents.
Readers assign a probability to answers. This score ranges from 0 to 1, indicating how well the results the Reader returned match the query. Probability closest to 1 means the model has high confidence in the answer's relevance. The Reader sorts the answers based on their probability scores, with higher probability listed first. You can limit the number of answers the Reader returns in the optional top_k
parameter.
You can use the probability to set the quality expectations for your system. To do that, use the confidence_score
parameter of the Reader to set a minimum probability threshold for answers. For example, setting confidence_threshold
to 0.7
means only answers with a probability higher than 0.7 will be returned.
By default, the Reader includes a scenario where no answer to the query is found in the document text (no_answer=True
). In this case, it returns an additional ExtractedAnswer
with no text and the probability that none of the top_k
answers are correct. For example, if top_k=4
the system will return four answers and an additional empty one. Each answer has a probability assigned. If the empty answer has a probability of 0.5, it means that's the probability that none of the returned answers is correct. To receive only the actual top_k answers, set the no_answer
parameter to False
when initializing the component.
Models
Here are the models that we recommend for using with ExtractiveReader
:
Model URL | Description | Language |
---|---|---|
deepset/roberta-base-squad2-distilled (default) | A distilled model, relatively fast and with good performance. | English |
deepset/roberta-large-squad2 | A large model with good performance. Slower than the distilled one. | English |
deepset/tinyroberta-squad2 | A distilled version of roberta-large-squad2 model, very fast. | English |
deepset/xlm-roberta-base-squad2 | A base multilingual model with good speed and performance. | Multilingual |
You can also view other question answering models on Hugging Face.
Usage
On its own
Below is an example that uses the ExtractiveReader
outside of a pipeline. The Reader gets the query and the documents at runtime. It should return two answers and an additional third answer with no text and the probability that the top_k
answers are incorrect.
from haystack import Document
from haystack.components.readers import ExtractiveReader
docs = [Document(content="Paris is the capital of France."), Document(content="Berlin is the capital of Germany.")]
reader = ExtractiveReader()
reader.warm_up()
reader.run(query="What is the capital of France?", documents=docs, top_k=2)
In a pipeline
Below is an example of a pipeline that retrieves a document from an InMemoryDocumentStore
based on keyword search (using InMemoryBM25Retriever
). It then uses the ExtractiveReader
to extract the answer to our query from the top retrieved documents.
With the ExtractiveReader’s top_k
set to 2, an additional, third answer with no text and the probability that the other top_k
answers are incorrect is also returned.
from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.readers import ExtractiveReader
docs = [Document(content="Paris is the capital of France."),
Document(content="Berlin is the capital of Germany."),
Document(content="Rome is the capital of Italy."),
Document(content="Madrid is the capital of Spain.")]
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)
retriever = InMemoryBM25Retriever(document_store = document_store)
reader = ExtractiveReader()
reader.warm_up()
extractive_qa_pipeline = Pipeline()
extractive_qa_pipeline.add_component(instance=retriever, name="retriever")
extractive_qa_pipeline.add_component(instance=reader, name="reader")
extractive_qa_pipeline.connect("retriever.documents", "reader.documents")
query = "What is the capital of France?"
extractive_qa_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
"reader": {"query": query, "top_k": 2}})
Updated 2 months ago