🖋️

Transformer-Based Information Retrieval Models: A Short Survey

These notes and tables are largely based on Pretrained Transformers for Text Ranking: BERT and Beyond.

There might be mistakes in this article. Seek the original source (see above) for confirmation if something doesn't make sense. email me if you find any mistakes so I can fix them.

In short, there are 3 different types of Transformer IR models:

  1. Sparse Methods
  1. Dense Methods:
  1. Hybrid Methods: Combines Sparse and Dense

We will discuss each and compare them.

1. Sparse Methods

Sparse retrieval methods work as follows:

  1. TF-IDF, Bag-of-Words (BoWs) exact-matching, or BM25 are used to retrieve k documents
  1. The k documents are then reranked using a more sophisticated model, such as a neural network

Part (b) is of special interest.  We shall explain some models that are used for this purpose. For this purpose, we will use the following notation:

1.1. monoBERT

We start with a pre-trained checkpoint of BERT.

Given query q, BM25 to generate first-docs. Then:

For d in first-docs:
	S = [CLS]  d  [SEP]  q [SEP]
	emb = BERT(S)[CLS]
	score  = FullyConnectedLayer(emb)

That is, we put the query and each document in a template and pass it to pert and use the embedding BERT generates for the CLS token. CLS refers to the classification task and BERT treats it as a special token to perform a classification task. Note that emb is a vector. Hence, we use a fully-connected layer to estimate the probability P(d is relevant to q); above this is referred to as the score.

The fully connected layer and BERT are trained end-to-end using the cross-entropy loss. In this way we fine-tune the pre-trained BERT model we started with.

🖋️
Note: BERT has a length limit of 512. So if the document is longer than 512 tokens, we have to limit it (e.g. by truncating it).

CAL Pseudocode

In the setting of CAL, the query is regenerated at each timestep according to the previously identified relevant documents. So, at test time, our pseudocode will be:

# refine_query can have multple implementation; one such implementation is included below
def refine_query(q, d) {
	q = q + d
	trim q to 256 tokens
}


def get_doc(q) {
	For d in get_docs_logistic_regression(docs):
		S = [CLS]  d  [SEP]  q [SEP]
		emb = BERT(S)[CLS]
		score  = FullyConnectedLayer(emb)
	return d with highest score
}

query = get_query_from_user()
while (get_doc(query) returns a document) {
	let d = get_doc(query)
	show d to user
	if user_clicks_rel(d):
		refine_query(q, d)
		remove d from docs
}

This pseudo-code generalizes to other methods discussed below by simply changing the get_doc subroutine.

1.2. duoBERT

Similar to monoBERT, but it measures the probability of document did_i being more relevant to qq than djd_j. As such, the algorithm is slightly different

For d_i d_j in first-docs:
	S = [CLS] q [SEP] d_i [SEP] d_j [SEP]
	emb = BERT(S)[CLS]
	p_i_j  = FullyConnectedLayer(emb)

Then the score for each document did_i is calculated based on all pi,jp_{i,j} using either a sum/min/max function, e.g. si=maxj(pi,j)s_i = \max_{j}(p_{i,j}), or by counting the number of occurrences where pi,j>0.5p_{i,j} > 0.5.

Note: For efficiency purposes duoBERT is usually used as a second reranker after monoBERT. That is, we usually rank with BM25, then rerank with monoBERT and then use duoBERT to rerank the top k documents from monoBERT. duoBERT usually have a modest improvement over simply using monoBERT.

1.3. monoT5

Similar to monoBERT but T5 is used instead of BERT.

For d in first-docs:
	S = query: q doc: d relevant:
	out = T5(S)
	score = softmax(out[true], out[false])

monoT5 performs better than monoBERT and duoBERT.

1.4. duoT5

Similar to duoBERT but uses T5 instead of BERT:

For d_i d_j in first-docs:
	S = query: q doc1: d_i doc2: doc_j relevant:
	out = T5(S)
	score = softmax(out[true], out[false])

Note that here "relevant" refers to did_i being more relevant than djd_j

2. Dense Methods

Dense methods use representation learning. In particular, they work as follows:

  1. Create a dense vector embedding (i.e. a vector with mostly non-zero values) of the query and the document. Namely, ηq\eta_q is used to embed the queries and ηd\eta_d is used to embed the documents
  1. Use an efficient similarity metric (such as inner product / cosine similarity) to measure the document-query relevance). We use ϕ\phi to denote this similarity metric.

For our setting, both ηq\eta_q and ηd\eta_d are transformer-based neural networks. In practice, ηq\eta_q and ηd\eta_d can be the same model.

Also, in theory ϕ\phi  can be any function such as another transformer-based neural network. But, this defeats the purpose of dense retrieval since 1) neural network models are not fast to compute 2) it would be equivalent to monoBERT.

In practice, ϕ\phi is a fast-to-compute function, such as the inner-product.

Hence, we compute the score for document did_i as follows: si=ϕ(ηq(q),ηd(di))=ηq(q),ηd(di)s_i = \phi(\eta_q(q), \eta_d(d_i)) = \langle\eta_q(q), \eta_d(d_i)\rangle

Ranking is done based on this score.

It is also common to write ηi=ηd(di)\eta_i = \eta_d(d_i) and ηq=ηq(q)\eta_q = \eta_q(q)

2.1. DPR

DPR (Dense Passage Retrieval) is one of the most well-known dense retrieval models.

DPR calculates the score as follows: si=BERT(di)[CLS],BERT(q)[CLS]s_i = \langle BERT(d_i)[CLS], BERT(q)[CLS] \rangle

Note that the two BERT models here do NOT share weights, so they are fine-tuned separately.

The model is trained using a dataset where each datapoint consists of a query, a positive document and nn negative documents. That is each datapoint is as follow: [qi,di+,di,1,...,di,n][q_i, d_i^+, d_{i,1}^-, ..., d_{i,n}^-]. In practice, choosing the negative samples for each query has a large impact on the effectiveness of the model. Although simple models such as BM25 can be used, they do not yield the best results. There are other approaches proposed such as ANN (Approximate Nearest Neighbour) search to find non-relavant documents that are ranked high by the model. This creates a learning-loop so the model can learn from its mistakes.

2.2. ANCE

ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation) works as follows:

ANCE calculates the score as follows: si=RoBERTa(di)[CLS],RoBERTa(q)[CLS]s_i = \langle RoBERTa(d_i)[CLS], RoBERTa(q)[CLS] \rangle

Unlike DPR, ANCE uses RoBERTa. Also, in contrast to DPR, the two RoBERTa models in ANCE share weights; that is they are the same model.

Datapoints to train ANCE are similar to that of DPR and the negative examples are chosen using a ANN approach.

🖋️
DPR and ANCE are bi-encoder models since they use 2 encoders to encode each query and document (technically, ANCE uses only one encoder, but when encoding the query model is not aware of the document and vice-versa.
🖋️
Models such as monoBERT are called cross-encoders since the query and document interact with each other when they are being encoded; that is the encoder takes as input put the query and the document to generate an encoding (Recall: [CLS] d [SEP] q [SEP])

Sparse vs. Dense

Dense retrieval methods, such as the ones explained above underperform (with respect to metrics such as MRR) compared to multi-stage sparse methods discussed previously (e.g. using BM25 to get a first stage ranking and using monoBERT/monoT5 to rerank the documents). However, at test time, these dense methods are much faster than the sparse methods discussed previously since the embedding of the documents can be pre-computed in parallel and stored before the test time. As such, these dense methods only need to embed the query and compute the efficient similarity metric between the embedding and each of the already-stored document embeddings.

PreTTR

PreTTR (Precomputing Transformer Term Representations) is similar to a monoBERT model but the query-to-doc attention and doc-to-query attention are eliminated in all layers expect the last layer. That is ηd\eta_d is all layers of monoBERT expect the last one (these can be precomputed, hence the name PreTTR). Moreoever, the ϕ\phi function is the last layer of the monoBERT model.

Poly-Encoders

Poly-encoders generate mm different encodings of each document. At search time, the query is encoded and an attention mechanism with respect to the query embeding is used to combine the mm embeddings into a single vector. The inner product of the query emebding and this combined embeding is the score that this model generates for this query-document pair.

ME-BERT

ME-BERT (Multi Vector Encoding from BERT).

ME-BERT with hyperparameter mm scores document d=[CLS,w1,...,wt]d = [CLS, w_1, ..., w_t] as follows: score=max0jBERT(d)[j],BERT(q)\text{score} = \max_{0 \le j} \langle BERT(d)[j], BERT(q) \rangle.

It considers the first mm vectors outputted by BERT when embeding the document. The first vector corresponds to the CLS token, the second corresponds to the first token in the documnet, and in general the ithi\th vector corresponds to the (i1)th(i-1)\th token/word from the document. Query is also embedded with BERT seperately. Then the inner-product of each of these mm vectors with embeding of the query is considerd and the maximum is returned as the score. This this multi-vector encoding obviously increases the performance of the model compared to only using the emebding for the CLS token.

ColBERT

Given d=[w1,...,wt]d = [w_1, ..., w_t], η(d)\eta(d) generates a t×Dt \times D matrix where DD is the embedding dimension as follows:

Colbert performs only slightly worse than monoBERT but is 2 orders of magnitude faster; however, it is still an order of magnitude slower than BM25.

So what's the catch? The index is orders of magnitude larger than that of BM25 to get the performance boost compared to monoBERT, the index needs to fit in memory to be searched efficiently.

Acknowledgements

These notes and tables are largely based on Pretrained Transformers for Text Ranking: BERT and Beyond.