Transformer-Based Information Retrieval Models: A Short Survey
These notes and tables are largely based on Pretrained Transformers for Text Ranking: BERT and Beyond.
There might be mistakes in this article. Seek the original source (see above) for confirmation if something doesn't make sense. email me if you find any mistakes so I can fix them.
In short, there are 3 different types of Transformer IR models:
- Sparse Methods
- Dense Methods:
- Hybrid Methods: Combines Sparse and Dense
We will discuss each and compare them.
1. Sparse Methods
Sparse retrieval methods work as follows:
- TF-IDF, Bag-of-Words (BoWs) exact-matching, or BM25 are used to retrieve k documents
- The k documents are then reranked using a more sophisticated model, such as a neural network
Part (b) is of special interest. We shall explain some models that are used for this purpose. For this purpose, we will use the following notation:
- Let docs = all documents
- Let first-docs = documents retrieved in the first stage by BM25 (“first” refers to first stage)
- Given sentence consisting of words , refers to the embedding created for when running through BERT. That is
1.1. monoBERT
We start with a pre-trained checkpoint of BERT.
Given query q, BM25 to generate first-docs. Then:
For d in first-docs:
S = [CLS] d [SEP] q [SEP]
emb = BERT(S)[CLS]
score = FullyConnectedLayer(emb)
That is, we put the query and each document in a template and pass it to pert and use the embedding BERT generates for the CLS token. CLS refers to the classification task and BERT treats it as a special token to perform a classification task. Note that emb is a vector. Hence, we use a fully-connected layer to estimate the probability P(d is relevant to q); above this is referred to as the score.
The fully connected layer and BERT are trained end-to-end using the cross-entropy loss. In this way we fine-tune the pre-trained BERT model we started with.
CAL Pseudocode
In the setting of CAL, the query is regenerated at each timestep according to the previously identified relevant documents. So, at test time, our pseudocode will be:
# refine_query can have multple implementation; one such implementation is included below
def refine_query(q, d) {
q = q + d
trim q to 256 tokens
}
def get_doc(q) {
For d in get_docs_logistic_regression(docs):
S = [CLS] d [SEP] q [SEP]
emb = BERT(S)[CLS]
score = FullyConnectedLayer(emb)
return d with highest score
}
query = get_query_from_user()
while (get_doc(query) returns a document) {
let d = get_doc(query)
show d to user
if user_clicks_rel(d):
refine_query(q, d)
remove d from docs
}
This pseudo-code generalizes to other methods discussed below by simply changing the get_doc
subroutine.
1.2. duoBERT
Similar to monoBERT, but it measures the probability of document being more relevant to than . As such, the algorithm is slightly different
For d_i d_j in first-docs:
S = [CLS] q [SEP] d_i [SEP] d_j [SEP]
emb = BERT(S)[CLS]
p_i_j = FullyConnectedLayer(emb)
Then the score for each document is calculated based on all using either a sum/min/max function, e.g. , or by counting the number of occurrences where .
Note: For efficiency purposes duoBERT is usually used as a second reranker after monoBERT. That is, we usually rank with BM25, then rerank with monoBERT and then use duoBERT to rerank the top k documents from monoBERT. duoBERT usually have a modest improvement over simply using monoBERT.
1.3. monoT5
Similar to monoBERT but T5 is used instead of BERT.
For d in first-docs:
S = query: q doc: d relevant:
out = T5(S)
score = softmax(out[true], out[false])
monoT5 performs better than monoBERT and duoBERT.

1.4. duoT5
Similar to duoBERT but uses T5 instead of BERT:
For d_i d_j in first-docs:
S = query: q doc1: d_i doc2: doc_j relevant:
out = T5(S)
score = softmax(out[true], out[false])
Note that here "relevant" refers to being more relevant than
2. Dense Methods
Dense methods use representation learning. In particular, they work as follows:
- Create a dense vector embedding (i.e. a vector with mostly non-zero values) of the query and the document. Namely, is used to embed the queries and is used to embed the documents
- Use an efficient similarity metric (such as inner product / cosine similarity) to measure the document-query relevance). We use to denote this similarity metric.
For our setting, both and are transformer-based neural networks. In practice, and can be the same model.
Also, in theory can be any function such as another transformer-based neural network. But, this defeats the purpose of dense retrieval since 1) neural network models are not fast to compute 2) it would be equivalent to monoBERT.
In practice, is a fast-to-compute function, such as the inner-product.
Hence, we compute the score for document as follows:
Ranking is done based on this score.
It is also common to write and
2.1. DPR
DPR (Dense Passage Retrieval) is one of the most well-known dense retrieval models.
DPR calculates the score as follows:
Note that the two BERT models here do NOT share weights, so they are fine-tuned separately.
The model is trained using a dataset where each datapoint consists of a query, a positive document and negative documents. That is each datapoint is as follow: . In practice, choosing the negative samples for each query has a large impact on the effectiveness of the model. Although simple models such as BM25 can be used, they do not yield the best results. There are other approaches proposed such as ANN (Approximate Nearest Neighbour) search to find non-relavant documents that are ranked high by the model. This creates a learning-loop so the model can learn from its mistakes.
2.2. ANCE
ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation) works as follows:
ANCE calculates the score as follows:
Unlike DPR, ANCE uses RoBERTa. Also, in contrast to DPR, the two RoBERTa models in ANCE share weights; that is they are the same model.
Datapoints to train ANCE are similar to that of DPR and the negative examples are chosen using a ANN approach.
Sparse vs. Dense
Dense retrieval methods, such as the ones explained above underperform (with respect to metrics such as MRR) compared to multi-stage sparse methods discussed previously (e.g. using BM25 to get a first stage ranking and using monoBERT/monoT5 to rerank the documents). However, at test time, these dense methods are much faster than the sparse methods discussed previously since the embedding of the documents can be pre-computed in parallel and stored before the test time. As such, these dense methods only need to embed the query and compute the efficient similarity metric between the embedding and each of the already-stored document embeddings.
PreTTR
PreTTR (Precomputing Transformer Term Representations) is similar to a monoBERT model but the query-to-doc attention and doc-to-query attention are eliminated in all layers expect the last layer. That is is all layers of monoBERT expect the last one (these can be precomputed, hence the name PreTTR). Moreoever, the function is the last layer of the monoBERT model.
Poly-Encoders
Poly-encoders generate different encodings of each document. At search time, the query is encoded and an attention mechanism with respect to the query embeding is used to combine the embeddings into a single vector. The inner product of the query emebding and this combined embeding is the score that this model generates for this query-document pair.
ME-BERT
ME-BERT (Multi Vector Encoding from BERT).
ME-BERT with hyperparameter scores document as follows: .
It considers the first vectors outputted by BERT when embeding the document. The first vector corresponds to the CLS token, the second corresponds to the first token in the documnet, and in general the vector corresponds to the token/word from the document. Query is also embedded with BERT seperately. Then the inner-product of each of these vectors with embeding of the query is considerd and the maximum is returned as the score. This this multi-vector encoding obviously increases the performance of the model compared to only using the emebding for the CLS token.
ColBERT
Given , generates a matrix where is the embedding dimension as follows:
- Let
-
- where the linear layer is applied to element-wise. This linear layer serves as a dimension-reducer
- is computed similarly
- The score is computed as follows:
- A clever nearest-neighbour indexing method is used to generate the first set of candidate documents and then retrieving the top documents from this smaller set
Colbert performs only slightly worse than monoBERT but is 2 orders of magnitude faster; however, it is still an order of magnitude slower than BM25.
So what's the catch? The index is orders of magnitude larger than that of BM25 to get the performance boost compared to monoBERT, the index needs to fit in memory to be searched efficiently.

Acknowledgements
These notes and tables are largely based on Pretrained Transformers for Text Ranking: BERT and Beyond.