Broadly speaking, there are two categories of retrieval methods: vector-based (dense) and keyword-based (sparse).
Sparse methods, like TF-IDF and BM25, operate by looking for shared keywords between the document and the query. These methods:
- Are simple but effective.
- Don’t need to be trained.
- Work on any language.
More recently, dense approaches such as Dense Passage Retrieval (DPR) have shown even better performance than their sparse counterparts. These methods embed both Document and query into a shared embedding space using deep neural networks and the top candidates are the nearest neighbor Documents to the query. They are:
- Powerful but computationally more expensive, especially during indexing.
- Trained using labeled datasets.
- Language specific.
You can combine both dense and sparce approaches that will result in hybrid retrieval. Read all about how to do this in our blog article.
Between these two types, there are also some qualitative differences. For example, sparse methods treat text as a bag-of-words meaning that they do not take word order and syntax into account, while the latest generation of dense methods use transformer-based encoders which are designed to be sensitive to these factors.
Also, dense methods are very capable of building strong semantic representations of text, but they struggle when encountering out-of-vocabulary words such as new names. By contrast, sparse methods don’t need to learn representations of words, they only care about whether they are present or absent in the text. As such, they handle out-of-vocabulary words with no problem.
Dense methods perform indexing by processing all the Documents through a neural network and storing the resulting vectors. This is a much more expensive operation than creating the inverted index used in sparse methods and requires significant computational power and time.
The terms dense and sparse refer to the representations that the algorithms build for each Document and query. Sparse methods characterize texts using vectors with one dimension corresponding to each word in the vocabulary. Dimensions are zero if the word is absent and non-zero if it is present. Since most Documents contain only a small subset of the full vocabulary, these vectors are considered sparse since non-zero values are few and far between.
Dense methods, by contrast, pass text as input into neural network encoders and represent text in a vector of a manually defined size (usually 768). Though individual dimensions are not mapped to any corresponding vocabulary or linguistic feature, each dimension encodes some information about the text. There are rarely 0s in these vectors hence their relative density.
Updated 6 months ago