Vector-Based vs Keyword-Based Retrievers
See what types of Retrievers are available within Haystack, how you can choose the best one for your use case, or how you can combine different approaches for hybrid retrieval.
Broadly speaking, there are two categories of retrieval methods: vector-based (dense) and keyword-based (sparse).
Sparse methods, like TF-IDF and BM25, operate by looking for shared keywords between the document and the query. These methods:
- Are simple but effective.
- Don’t need to be trained.
- Work on any language.
More recently, dense approaches such as Dense Passage Retrieval (DPR) have shown even better performance than their sparse counterparts. These methods embed both Document and query into a shared embedding space using deep neural networks and the top candidates are the nearest neighbor Documents to the query. They are:
- Powerful but computationally more expensive, especially during indexing.
- Trained using labeled datasets.
- Language specific.
Hybrid Retrieval
You can combine both dense and sparce approaches that will result in hybrid retrieval. Read all about how to do this in our blog article.
Qualitative Differences
Between these two types, there are also some qualitative differences. For example, sparse methods treat text as a bag-of-words meaning that they do not take word order and syntax into account, while the latest generation of dense methods use transformer-based encoders which are designed to be sensitive to these factors.
Also, dense methods are very capable of building strong semantic representations of text, but they struggle when encountering out-of-vocabulary words such as new names. By contrast, sparse methods don’t need to learn representations of words, they only care about whether they are present or absent in the text. As such, they handle out-of-vocabulary words with no problem.
Indexing
Dense methods perform indexing by processing all the Documents through a neural network and storing the resulting vectors. This is a much more expensive operation than creating the inverted index used in sparse methods and requires significant computational power and time.
Terminology
The terms dense and sparse refer to the representations that the algorithms build for each Document and query. Sparse methods characterize texts using vectors with one dimension corresponding to each word in the vocabulary. Dimensions are zero if the word is absent and non-zero if it is present. Since most Documents contain only a small subset of the full vocabulary, these vectors are considered sparse since non-zero values are few and far between.
Dense methods, by contrast, pass text as input into neural network encoders and represent text in a vector of a manually defined size (usually 768). Though individual dimensions are not mapped to any corresponding vocabulary or linguistic feature, each dimension encodes some information about the text. There are rarely 0s in these vectors hence their relative density.
Updated over 1 year ago