Retriever
The Retriever performs document retrieval by sweeping through a DocumentStore and returning a set of candidate Documents that are relevant to the query. See what Retrievers are available and how to choose the best one for your use case.
In a query pipeline, the Retriever takes a query as input and checks it against the Documents contained in the DocumentStore. It scores each Document for its relevance to the query and returns the top candidates.
We have tested Haystack Retrievers on various modalities, including:
- Text
- Tables
- Images
The Retriever is tightly coupled with the DocumentStore. You must specify a DocumentStore when initializing the Retriever.
When used in combination with a Reader, the Retriever can quickly sift out irrelevant Documents, saving the Reader from doing more work than it needs to and speeding up the querying process.
In indexing pipelines, vector-based Retrievers take Documents as input, and for each Document, they calculate its embedding. This embedding is stored as part of the Document in the DocumentStore. If you're using a keyword-based Retriever in your indexing pipeline, no embeddings are calculated. The Retriever creates a keyword-based index that it uses for quickly looking Documents up.
Position in a Pipeline | At the beginning of a query pipeline After a PreProcessor and before DocumentStore in an indexing pipeline. |
Input | In query pipelines: Query In indexing pipelines: Document |
Output | Documents (in both indexing and query pipelines) |
Classes | BM25Retriever DensePassageRetriever TableTextRetriever FilterRetriever EmbeddingRetriever TfidfRetriever MultiModalRetriever WebRetriever LinkContentFetcher |
Choosing the Right Retriever
If you're unsure which Retriever to use, see the sections below explaining each Retriever type. Our starting recommendations are to use an EmbeddingRetriever if you can use GPU acceleration. If you can't use GPU, we recommend the BM25Retriever.
Usage
To initialize a Retriever, pass a DocumentStore as its argument:
from haystack.nodes import BM25Retriever
retriever = BM25Retriever(document_store)
To run a Retriever on its own, use the retrieve()
method. It returns a list of Document objects:
candidate_documents = retriever.retrieve(
query="international climate conferences",
top_k=10,
filters={"year": ["2015", "2016", "2017"]}
)
Here's an example how to run a Retriever within a ready-made query pipeline:
from haystack.pipelines import DocumentSearchPipeline
pipeline = DocumentSearchPipeline(retriever=retriever)
result = pipeline.run(
query="international climate conferences",
params={
"Retriever": {
"top_k": 10,
"filters": {"year": ["2015", "2016", "2017"]}
}
}
)
This is how you can use a Retriever in an indexing pipeline:
from haystack.document_stores import DeepsetCloudDocumentStore
from haystack.nodes import EmbeddingRetriever, PreProcessor, TextConverter, PDFConverter, FileTypeClassifier
from haystack.pipelines import Pipeline
document_store = InMemoryDocumentStore()
embedding_retriever = EmbeddingRetriever(document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1", model_format="sentence_transformers", top_k=20)
file_classifier = FileTypeClassifier()
text_converter = TextConverter()
pdf_converter = PDFConverter()
preprocessor = PreProcessor(split_by="word", split_length=250, split_overlap=30, split_sentence_boundary=True, language="en")
indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=file_classifier, name="FileTypeClassifier", inputs=["File"])
indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["FileTypeClassifier.output_1"])
indexing_pipeline.add_node(component=pdf_converter, name="PDFConverter", inputs=["FileTypeClassifier.output_2"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter", "PDFConverter"])
indexing_pipeline.add_node(component=embedding_retriever, name="EmbeddingRetriever", inputs=["PreProcessor"])
indexing_pipeline.add_node(component=document_store, name="InMemoryDocumentStore", inputs=["Retriever"])
DocumentStore Compatibility
Note that not all Retrievers can be paired with every DocumentStore. Here are the combinations which are supported:
InMemory | Elasticsearch | OpenSearch | SQL | Milvus | Weaviate | Pinecone | Qdrant | FAISS | |
---|---|---|---|---|---|---|---|---|---|
BM25 | Y | Y | Y | N | N | Y | N | N | N |
TF-IDF | Y | Y | Y | Y | N | Y | Y | N | N |
Embedding | Y | Y | Y | N | Y | Y | Y | Y | Y |
Multihop | Y | Y | Y | N | Y | Y | Y | Y | Y |
DPR | Y | Y | Y | N | Y | Y | Y | Y | Y |
Filter | Y | Y | Y | Y | Y | Y | Y | Y | Y |
MultiModal | Y | Y | Y | Y | Y | Y | Y | Y | Y |
WebRetriever | Y | Y | Y | Y | Y | Y | Y | Y | Y |
See Optimization for suggestions on how to choose top-k values.
Text Retrieval
BM25 (Recommended)
Use BM25 if you are looking for a retrieval method that doesn't need a neural network for indexing. BM25 is a variant of TF-IDF. It improves upon its predecessor in two main aspects:
- It saturates
tf
after a set number of occurrences of the given term in the document - It normalises by document length so that short documents are favoured over long documents if they have the same amount of word overlap with the query
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import BM25Retriever
from haystack.pipelines import ExtractiveQAPipeline
document_store = ElasticsearchDocumentStore()
... retriever = BM25Retriever(document_store)
... p = ExtractiveQAPipeline(reader, retriever)
For more information about the algorithm, see BM25 algorithm.
Embedding Retrieval (Recommended)
In Haystack, you have the option of using a transformer model to encode document and query. Haystack loads models directly from Hugging Face. If you're new to NLP, choosing the right model may be a difficult task. To make it easier, we suggest searching for a model on Hugging Face:
- Go to Hugging Face and click Models in the top menu.
- From the Tasks on the left, select Sentence Similarity and filter the models by Most Downloads. You get a list of most popular models. It's best to start with one of them.
To use a private model hosted on Hugging Face, enter your Hugging Face access token in the use_auth_token
parameter. For more information about models, see Language Models.
One style of model that is suited to this kind of retrieval is Sentence Transformers. These models are trained in Siamese Networks and use triplet loss such that they learn to embed similar sentences near to each other in a shared embedding space.
Some models have been fine-tuned on massive information retrieval datasets and can be used to retrieve documents based on a short query (for example, multi-qa-mpnet-base-dot-v1
). There are others that are more suited to semantic similarity tasks where you are trying to find the most similar documents to a given document (for example, all-mpnet-base-v2
). There are even models that are multilingual (for example, paraphrase-multilingual-mpnet-base-v2
). For a good overview of different models with their evaluation metrics, see Pretrained Models in the Sentence Transformers documentation.
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import ExtractiveQAPipeline
document_store = ElasticsearchDocumentStore(
similarity="dot_product",
embedding_dim=768
)
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
model_format="sentence_transformers"
)
document_store.update_embeddings(retriever)
... p = ExtractiveQAPipeline(reader, retriever)
You can also use OpenAI or Cohere embeddings with the EmbeddingRetriever
. You need to provide your API key and specify an embedding_model
when initializing the EmbeddingRetriever
.
from haystack.nodes import EmbeddingRetriever
# OpenAI EmbeddingRetriever
retriever = EmbeddingRetriever(
document_store=document_store,
batch_size=8,
embedding_model="text-embedding-ada-002",
api_key="<your_openai_api_key_goes_here>",
max_seq_len=1536
)
# Cohere EmbeddingRetriever
retriever = EmbeddingRetriever(
document_store=document_store,
batch_size=8,
embedding_model="embed-english-v2.0",
api_key="<your_cohere_api_key_goes_here>",
max_seq_len=1024
)
When working with OpenAI embeddings, supply text-embedding-ada-002 as the embedding_model
.
The OpenAI Embeddings API is subject to rate limits. However, we have added a built-in exponential back-off algorithm that saves you from needing to implement any rate-limit handling.
With Cohere, you can set model_name
to one of the supported embed-
models listed on Cohere's models documentation. Haystack currently supports Cohere models up to and including v2.0.
Multihop Embedding Retriever
MultihopEmbeddingRetriever
is an extension of EmbeddingRetriever
that works iteratively. In the first iteration, it retrieves a set of Documents that best match the query. It then concatenates and embeds the query and the top-ranked Documents from the first iteration to provide context for the query. Then, it uses this embedding as input in the second iteration to retrieve Documents that best match the query and its context. You can set the number of iterations you want it to go through. By default, it's set to two iterations but you can experiment with it to find out what works best for your case. This type of retrieval is useful for fact checking, where a chain of evidence pieces leads to the final answer.
MultihopEmbeddingRetriever
uses one encoder for the query and the Documents.
For more information, see Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval.
Dense Passage Retrieval
Dense Passage Retrieval is a retrieval method that calculates relevance using dense representations. Key features:
- One BERT base model to encode documents
- One BERT base model to encode queries
- Ranking of Documents done by dot product similarity between query and document embeddings
Indexing using DPR is comparatively expensive in terms of required computation since all documents in the database need to be processed through the transformer. In order to keep query times low, you should store these embeddings in a vector-optimized database such as FAISS or Milvus.
In Haystack, you can download the pre-trained encoders needed to start using DPR. For DPR, you need to provide two models - one for the query and one for the documents, however, the models must be trained on the same data. The easiest way to start is to go to Hugging Face and search for dpr
. You'll get a list of DPR models sorted by Most Downloads, which means that the models at the top of the list are the most popular ones. Choose a ctx_encoder
and a question_encoder
model.
To use a private model hosted on Hugging Face, enter your Hugging Face access token in the use_auth_token
parameter. For more information about models, see Language Models.
Tip
When using DPR, it is recommended that you use the
dot product
similarity function since that is how it is trained. To do so, simply providesimilarity='dot_product'
when initializing the DocumentStore as in the code example below.
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import DensePassageRetriever
from haystack.pipelines import ExtractiveQAPipeline
document_store = FAISSDocumentStore(similarity="dot_product")
... retriever = DensePassageRetriever(
document_store=document_store,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
)
... pipeline = ExtractiveQAPipeline(reader, retriever)
Training DPR
Haystack supports training of your own DPR model. Check out the Training a Dense Passage Retrieval model tutorial to see how this is done.
TF-IDF
TF-IDF is a commonly used baseline for information retrieval that exploits two key intuitions:
- Documents that have more lexical overlap with the query are more likely to be relevant.
- Words that occur in fewer documents are more significant than words that occur in many documents.
Given a query, a TF-IDF score is computed for each document as follows:
score = tf * idf
Where:
tf
is how many times words in the query occur in that document.idf
is the inverse of the fraction of documents containing the word.
In practice, both terms are usually log normalized.
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import TfidfRetriever
from haystack.pipelines import ExtractiveQAPipeline
document_store = InMemoryDocumentStore()
... retriever = TfidfRetriever(document_store)
... p = ExtractiveQAPipeline(reader, retriever)
We recommend looking at BM25 retrieval as it is an improved successor to TF-IDF.
Table Retrieval
The TableTextRetriever
is designed to perform document retrieval on both text and tabular documents. It is a tri-encoder model with a separate encoder for the query, text passage, and table.
To learn more about how to use this component in Haystack, have a look at our Table Question Answering guide.
MultiModal Retrieval
Use the MultiModalRetriever
to embed and search for data of different modalities such as text, table, and image. The MultiModalRetriever
can handle any modality as long as there is a SentenceTransformers model that supports it. For example, you can perform the following types of search:
- Text to image
- Image similarity
- Text to table
- Table similarity
To prepare your data for the MultiModalRetriever
, cast them into Document objects. For example, to iterate over a directory of images and turn them into a list of Documents, run:
docs = [
Document(content=f"./examples/images/{filename}", content_type="image")
for filename in os.listdir("./examples/images")
]
To initialize a MultiModalRetriever
that performs text to image retrieval using the [sentence-transformers/clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32)
model, run:
retriever_mm = MultiModalRetriever(
document_store=document_store,
query_embedding_model = "sentence-transformers/clip-ViT-B-32",
query_type="text",
document_embedding_models = {
"image": "sentence-transformers/clip-ViT-B-32"
}
)
document_store.update_embeddings(retriever_mm)
To initialize a Retriever for text to table retrieval using the [deepset/all-mpnet-base-v2-table](https://huggingface.co/deepset/all-mpnet-base-v2-table)
model, run:
retriever_mm = MultiModalRetriever(
document_store=document_store,
query_embedding_model = "deepset/all-mpnet-base-v2-table",
document_embedding_models = {
"table": "deepset/all-mpnet-base-v2-table"
}
)
document_store.update_embeddings(retriever_mm)
For more information about the class, see MultiModalRetriever API.
WebRetriever
WebRetriever retrieves results from the Internet and converts them into Haystack Document objects. It uses the WebSearch component to do that. It can work in three modes: snippets
, raw_documents
, and preprocessed_documents
.
In the snippets
mode, WebRetriever retrieves only the snippets of the results, not the whole web pages. By a snippet, we mean the text that appears right after the page title in search results, as highlighted in this image:
In the raw_documents
mode, WebRetriever retrieves the whole pages, scraps them off HTML tags, and turns them into Documents.
In the preprocessed_documents
mode, WebRetriever retrieves whole pages and preprocesses them, which includes:
- Cleaning the HTML tags.
- Placing the resulting raw text into Document objects.
- Splitting the Documents according to the PreProcessor settings. (You can pass PreProcessor as a WebRetriever parameter.)
To save time and resources, you can choose to store the results WebRetriever got from the web in a DocumentStore. If you do so, then during the next query, WebRetriever first checks if the documents it needs are already in the DocumentStore. Only if they're not there, it searches the internet for them. It's compatible with all DocumentStores.
You can configure a PreProcessor to use with WebRetriever to decide how you want the Documents to be processed.
LinkContentFetcher
Note: This is a component in development. We plan on making changes to the current design and introducing new features in the near future.
LinkContentFetcher
retrieves content from specified web pages and converts them into Documents. It is a standalone component that can be used in a pipeline. It currently supports only HTML page types.
In case the LinkContentFetcher
does not retrieve any content or the HTTP request is blocked, the search engine snippet is returned (if available).
Updated about 1 year ago