# Pre-training Methods in Information Retrieval

Fan Y, Xie X, Cai Y, et al. Pre-training Methods in Information Retrieval[J]. arXiv preprint arXiv:2111.13853, 2021.

## Notes

### Intro & Background

• 什么是 IR？
• 从一个大规模的答案语料集合中找回跟用户的请求有关的信息
• 可能相关的结果有多个，所以需要定义“相关性得分”
• 什么是预训练？（略去）
• 从不同的视角来看待 IR 问题
• Core Problem View: 核心的问题是计算 Query q 和 Document d 的相似度
• Framework View: Retrieval Process，returns top-k most relative results
• System View: Given a query q, output a sorted list of documents…

### Retrieval Component

Sparse Retrieval 是指 token-level 的切分，常见的算法如 TF-IDF 和 BM25。这两种算法的运作方式在 https://c7w.tech/elasticsearch/ 的第一节进行过相关介绍，这里直接对其进行引用：

• TF-IDF

TF 是指归一化的词频，IDF 是指逆文档频率。给定文档集合 $D$，有 $d_i \in D, 1 \le i \le n$.

TF 只能描述词在文档中的频率，但假设现在有个词为“我们”，这个词可能在文档集 $D$ 中每篇文档中都会出现，并且有较高的频率。那么这一类词就不具有很好的区分文档的能力，为了降低这种通用词的作用，引入了 IDF：

TF 可以计算在一篇文档中词出现的频率，而 IDF 可以降低一些通用词的作用。因此对于一篇文档我们可以用文档中每个词的 TF−IDF 组成的向量来表示该文档，再根据余弦相似度这类的方法来计算文档之间的相关性。

• BM25

BM25 是信息索引领域用来计算 query 与文档相似度得分的经典算法。

1. query 中每个单词 $q_i$ 与文档 $d$ 之间的相关性
2. 单词 $q_i$ 与 query 之间的相似性
3. 每个单词的权重

BM25 算法的一般公式：

• $W_i$

• $R(q_i, d)$

BM25 的设计依据一个重要的发现：词频和相关性之间的关系是非线性的，也就是说，每个词对于文档的相关性分数不会超过一个特定的阈值，当词出现的次数达到一个阈值后，其影响就不在线性增加了，而这个阈值会跟文档本身有关。

#### Sparse Retrieval

• Term re-weighting: measure term weights with contextual semantics.
• Document Expansion: expanding documents or queries.
• Re-weighting + expansion
• Sparse Representation learning

#### Dense Retrieval

• Use pretrained models as encoders, then fine-tune them accordingly.
• Use specific tasks to pretrain for IR
• Fine-tuning: distill; using informative negative models;

### Re-ranker Component

• Representation focused $relevance = f(PLM(Q), PLM(D))$
• Interaction focused $relevance=f(PLM(Q,D))$

### Other Component

Query Understanding:

• Query expansion
• Query rewriting
• Query suggestion$^*$
• Search Clarification
• Personalized Search

Document Summarization

• Generic Document Summarization
• Snippet Generation
• Keyphrase Extraction

# Latent Retrieval for Weakly Supervised Open Domain Question Answering

Lee K, Chang M W, Toutanova K. Latent retrieval for weakly supervised open domain question answering[J]. arXiv preprint arXiv:1906.00300, 2019.

## Background Infomation

• 什么是 Open Domain 的 QA？简称 ODQA，中文翻译为开放式问答，意为基于涵盖广泛主题的文本集合给出问题答案。

Definition: Formally speaking, to give an answer based on the document collection covering wide range of topics is called open-domain question answering (ODQA).

Challenges: The ODQA task combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer span from those articles).

Architecture: There are several approaches to the architecture of an ODQA system. A modular ODQA system consists of two components, the first one (the ranker) should be able to find the relevant articles in a database (e.g., Wikipedia), whereas the second one (the reader) extracts an answer from a single article or a small collection of articles retrieved by the ranker. In addition to the strictly two-component ODQA systems, there are hybrid systems that are based on several rankers where the last ranker in the pipeline is combined with an answer extraction module usually via reinforcement learning.

• 什么是 Latent Variable?

In statistics, latent variables (from Latin: present participle of lateo (“lie hidden”), as opposed to observable variables) are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).

## Motivation

（Review 现有的：① DrQA 2017，训练时用 question-answer-evidence pair，测试时抓一个 IR 系统过来生成 evidence. ② TriviaQA, SearchQA, Quasar，弱监督，也是依赖 IR 去生成 evidence）

## Approach / Feature

In this work, we introduce the first Open Retrieval Question Answering system (ORQA). ORQA learns to retrieve evidence from an open corpus, and is supervised only by question-answer string pairs.

The key insight of this work is that end-to-end learning is possible if we pre-train the retriever with an unsupervised Inverse Cloze Task (ICT).

What is ICT? In ICT, a sentence is treated as a pseudo question, and its context is treated as pseudo evidence. Given a pseudo-question, ICT requires selecting the corresponding pseudo-evidence out of the candidates in a batch.

An important aspect of ORQA is its expressivity—it is capable of retrieving any text in an open corpus, rather than being limited to the closed set returned by a blackbox IR system.

## Experiment

### Architecture

• Retriever component

### Training

• Inverse Cloze Task: 一种 Pretrain 的方法

Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets.

### Evaluation

Evaluation was carried out on the following datasets:

• Natural Questions
• WebQuestions
• CuratedTrec
• TriviaQA

## Conclusion

We presented ORQA, the first open domain question answering system where the retriever and reader are jointly learned end-to-end using only question-answer pairs and without any IR system.

This is made possible by pre-training the retriever using an Inverse Cloze Task (ICT).

Experiments show that learning to retrieve is crucial when the questions reflect an information need, i.e. the question writers do not already know the answer.

# Domain-matched pre-training tasks for dense retrieval

## Motivation

IR is a exception that pre-training doesn’t produce convincing results. But with right setup, this barrier could be overcome.

So what is a right setup?

It’s been generally accepted that the more similar the end task is to the pre-training task, the larger the gains. We hypothesise that previously proposed pretraining tasks might be still too distant from the target task, which limits useful transfer.

## Approach

We therefore investigate pre-training tasks for retrieval which are as closely matched to the the target task and domain as possible. To this end, we propose using two corpora for retrieval pre-training:

1) 65M synthetically generated question-answer pairs.
2) A corpus of 220 million post-comment pairs from Reddit, which we use for dialogue retrieval tasks.

Finally we can prove that:

1. pre-training leads to strong achievements in both cases
2. domain similarity and task similarity both matters
3. the retrieval can benefit from larger models

## Dense Retrieval

### Bi-encoder architecture

Query encoder $E_Q$, passage encoder $E_p$, both output a fixed $d$-dim representation for each query / passage.

Passages are pre-processed offline, and their representations are indexed using a fast vector similarity search library such as FAISS(?)

Then when an query $q$ arrives we can use $E_Q(q)$ as its representation and use the index library to get the top-k closest passages.

### Training

Given a query, a relevant (+) passage and a list of non-relevant (-) passages, the network is trained to minimize the negative log likelihood of picking the positive passage. And the probability assigned to each passage is proportional to $e^{sim(query, passage)}$.

We do training in two steps:

• use a single BM25 negative per query
• use hard negatives obtained using the first round model

## Experimental setup

• PAQ
• Reddit

• Passage retrieval
• MSMARCO
• Natural Questions
• KILT
• Dialogue retrieval (to show the generality of conclusions)
• ConvAI2
• Ubuntu v2
• DSTC7

# Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

## Motivation

However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential.

• iterative negative mining
• multi-vector representations

In this paper, we identify and address two underlying problems of dense retrievers:

i) fragility to training data noise

ii) requiring large batches to robustly learn the embedding space.

Then we try to give a hypothesis about why RocketQA (denoising + large batch size) succeeded.

• Denoising -> remove mislabelled samples
• Large bs -> CLS vectors are not well trained, large training batches just helps the LM to learn to form the full embedding space.

## Approach

We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. (?)

On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss (?) to warm up the passage embedding space.

Then we could think up a way that could realize the two goals without these two approaches.

• Noise resistance -> use Condenser pre-training architecture
• Introduce a corpus-level contrastive learning objective: at each training step sample text pairs; train the model such that the CLS embeddings of text pairs from the same doc are close and those from different documents are far apart.

-> Combinating the two, propose coCondenser pre-training method.

## Experiment Method

### Architecture

• Based on Condenser

• Added contrastive loss to loss function

• Universal
• Corpus aware

• Wikipedia
• MS-MARCO

# Sparse, Dense, and Attentional Representations for Text Retrieval

Luan Y, Eisenstein J, Toutanova K, et al. Sparse, dense, and attentional representations for text retrieval[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 329-345.

## Motivation

（首先对比 Dense Retrieval 与传统的 Sparse Retrieval.）Dual encoders perform retrieval by encoding documents and queries into dense low-dim vectors, scoring each document by its inner product with the query. We investigate the capacity of this architecture relative to sparse bag-of-words models and attentional neural networks.

• Sparse Retrieval: more recent work has adopted a two-stage retrieval and ranking pipeline, where a large number of documents are retrieved using sparse high dimensional query/document representations, and are further reranked with learned neural models
• Dense Retrieval: A promising alternative is to perform first-stage retrieval using learned dense low-dimensional encodings of documents and queries. The dual encoder model scores each document by the inner product between its encoding and that of the query.

### Analyzing dual encoder fidelity

And that is, how much can we compress the input while maintaining the ability to mimic the performance of bag-of-words retrieval?

Section 2 里证明了：Fidelity is important for the sub-problem of detecting precise term overlap, and is a tractable proxy for capacity. Using the theory of dimensionality reduction, we relate fidelity to the normalized margin between the gold retrieval result and its competitors, and show that this margin is in turn related to the length of documents in the collection. （没仔细看证明过程）。

## Approach / Feature

Building on these insights, we propose a simple neural model that combines the efficiency of dual encoders with some of the expressiveness of more costly attentional architectures, and explore sparse-dense hybrids to capitalize on the precision of sparse retrieval. These models outperform strong alternatives in large-scale retrieval.

### Multi-vector Encodings

The theoretical analysis suggests that fixed-length vector representations of documents may in general need to be large for long documents, if fidelity with respect to sparse high-dimensional representations is important.

### Hybrid

A natural approach to balancing between the fidelity of sparse representations and the generalization of learned dense ones is to build a hybrid.

To do this, we linearly combine a sparse and dense system’s scores using a single trainable weight λ, tuned on a development set.

## Experiment

• Retrieval for Open-domain QA

• Large Scale Supervised IR

## Conclusion

We have used both theoretical and empirical techniques to characterize the fidelity of fixed-length dual encoders, focusing on the role of document length.

Based on these observations, we propose hybrid models that yield strong performance while maintaining scalability.

# Condenser: a pretraining architecture for dense retrieval

## Motivation

However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations.

Reasons?

This paper finds a key reason is that standard LMs’ internal attention structure is not ready-to-use for dense encoders, which needs to aggregate text information into the dense representation.

Attention patterns, therefore, define how effective CLS can aggregate information.

In other words, the CLS token remains dormant in many middle layers and reactivates only in the last round of attention.

## Approach

We propose to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation.

## Experiment

### Architecture

• Pre-train

$L_{mlm} = \sum_{i \in masked} CrossEntropy(Wh_i^{cd}, x_i)$

$L_{mlm}^c = \sum _ {i \in masked} CrossEntropy(Wh_i^{late}, x_i)$

• Fine tune

fine tune 的时候直接把这个 head 给 drop 掉，变成了普普通通的 Transformer 模型.

### Fine tuning

1. Sentence Similarity

Semantic Textual Similarity Benchmark

Wikipedia Section Distinction

1. Retrieval for Open QA
• NQ
• TriviaQA
1. Retrieval for web search
• MS MARCO

# PRE-TRAINING TASKS FOR EMBEDDING-BASED LARGE-SCALE RETRIEVAL

Chang W C, Yu F X, Chang Y W, et al. Pre-training tasks for embedding-based large-scale retrieval[J]. arXiv preprint arXiv:2002.03932, 2020.

## Motivation

Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied.

Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest.

## Feature

In this paper, we conduct a comprehensive study on the embedding-based retrieval models. (Namely Dense Retrieval!)

We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three.

We contribute the following insight:

• The two-tower Transformer models (Retrieval Stage + Reranking stage) with proper pre-training can significantly outperform the widely used BM-25 algorithm;
• Paragraph-level pre-training tasks such as Inverse Cloze Task (ICT), Body First Selection (BFS), and Wiki Link Prediction (WLP) hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains ( marginal: small and not important)
• The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart

# From doc2query to docTTTTTquery

Nogueira R, Lin J, Epistemic A I. From doc2query to docTTTTTquery[J]. Online preprint, 2019, 6.

## Motivation

Nogueira et al. [7] used a simple sequence-to-sequence transformer [9] for document expansion. We replace the transformer with T5 [8] and observe large effectiveness gains.

# Document Expansion by Query Prediction

## Motivation

One technique to improve the retrieval effectiveness of a search engine is to expand documents with terms that are related or representative of the documents’ content

## Feature

Following this observation, we propose a simple method that predicts which queries will be issued for a given document and then expands it with those predictions with a vanilla sequence-to-sequence model, trained using datasets consisting of pairs of query and relevant documents.

• Method [Doc2Query]: For each document, the task is to predict a set of queries for which that document will be relevant.
• Given a dataset of (query, relevant document) pairs, we use a sequence-to-sequence transformer model (Vaswani et al., 2017) that takes as an input the document terms and produces a query.
• The document and target query are segmented using BPE (Sennrich et al., 2015) after being tokenized with the Moses tokenizer.1
• Once the model is trained, we predict 10 queries using top-k random sampling and append them to each document in the corpus.

## Experiment

Evaluation was carried out on:

• MS MARCO
• TREC-CAR

# ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

## Motivation

While remarkably effective, the ranking models based on these LMs increase computational cost by orders of magnitude over prior approaches, particularly as they must feed each query–document pair through a massive neural network to compute a single relevance score

## Feature

To tackle this, we present ColBERT, a novel ranking model that adapts deep LMs (in particular, BERT) for efficient retrieval.

ColBERT introduces a late interaction architecture that independently encodes the query and the document using BERT and then employs a cheap yet powerful interaction step that models their fine-grained similarity.

Under late interaction, 𝑞 and 𝑑 are separately encoded into two sets of contextual embeddings, and relevance is evaluated using cheap and pruning-friendly computations between both sets—that is, fast computations that enable ranking without exhaustively evaluating every possible candidate.

# Complement Lexical Retrieval Model with Semantic Residual Embeddings

## Feature

This paper presents clear, a retrieval model that seeks to complement classical lexical exact-match models such as BM25 with semantic matching signals from a neural embedding matching model.

## Approach

clear consists of a lexical retrieval model and an embedding retrieval model. Between these two models, one’s weakness is the other’s strength: lexical retrieval performs exact token matching but cannot handle vocabulary mismatch; meanwhile, the embedding retrieval supports semantic matching but loses granular (lexical level) information.

To ensure that the two types of models work together and fix each other’s weakness, we propose a residual-based learning framework that teaches the neural embeddings to be complementary to the lexical retrieval.

BM25:

### Embedding Retrieval Model

BERT: shared weight

### Residual Based Learning

To make the best use of the embedding model, we must avoid the embedding model “relearning” signals already captured by the lexical model. Instead, we focus its capacity on semantic level matching missing in the lexical model.

where $[x]^+ = max\{0,x\}$

• Error-based Negative Sampling

Sample negative examples from those documents mistakenly retrieved by lexical retrieval.

Given a positive query-document pair, we uniformly sample irrelevant examples from the top N documents returned by lexical retrieval with probability p. With such negative samples, the embedding model learns to differentiate relevant documents from confusing ones that are lexically similar to the query but semantically irrelevant.

• Residual-based Margin

Intuitively, different query-document pairs require different levels of extra semantic information for matching on top of exact matching signals.

Our negative sampling strategy does not tell the neural model the degree of error made by the lexical retrieval that it needs to fix.

# Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring

## Motivation

The former often performs better, but is too slow for practical use.

## Feature

In this work, we develop a new transformer architecture, the Poly-encoder, that learns global rather than token level self-attention features.

We introduce the Poly-encoder, an architecture with an additional learnt attention mechanism that represents more global features from which to perform self-attention, resulting in performance gains over Bi-encoders and large speed gains over Cross-Encoders

## Poly-Encoder

A given candidate label is represented by one vector as in the Bi-encoder, which allows for caching candidates for fast inference time, while the input context is jointly encoded with the candidate, as in the Cross-encoder, allowing the extraction of more information.

The Poly-encoder uses two separate transformers for the context and label like a Bi-encoder, and the candidate is encoded into a single vector $y_{candi}$ .

As such, the Poly-encoder method can be implemented using a precomputed cache of encoded responses. However, the input context, which is typically much longer than a candidate, is represented with m vectors ($y^1_{ctxt}, \cdots, y^{m}_{ctxt}$) instead of just one as in the Bi-encoder, where m will influence the inference speed.

To obtain these m global features that represent the input, we learn m context codes $(c_1, \cdots, c_m)$, where $c_i$ extracts representation $y^i_{ctxt}$ by attending over all the outputs of the previous layer:

The m context codes are randomly initialized, and learnt during finetuning. Finally, given our m global context features, we attend over them using $y_{candi}$ as the query:

The final score for that candidate label is then $y_{ctxt} \cdot y_{candi}$ as in a Bi-encoder. As m < N, where N is the number of tokens, and the context-candidate attention is only performed at the top layer, this is far faster than the Cross-encoder’s full self-attention.

# Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval

## Motivation

However, this simple structure may cause serious information loss during the encoding of documents since the queries are agnostic.

As it is very common that a document with hundreds of tokens contains several distinct topics, some important semantic information might be easily missed or biased by each other without knowing the query.

## Feature

To address this problem, we design a method to mimic the queries on each of the documents by an iterative clustering process and represent the documents by multiple pseudo queries.

To alleviate the query agnostic problem, we propose a novel approach that mimics multiple potential queries corresponding to the input document and we call them “pseudo query embeddings”.

Ideally, each of the pseudo query embeddings corresponds to a semantic salient (most important or noticeable) fragment in the document which is similar to a semantic cluster of the document.

Thus, we implement the process by a clustering algorithm (i.e., K-means in this work) and regard the cluster centroids as the pseudo query embeddings.

• This is a novel approach to represent the document with multiple pseudo query embeddings which are generated by a clustering process.

## Review: Aggregator

Independent Aggregator

$q_\star$ and $d_\star$ are the direct output of the BERT layer. A pooler is needed to extract the inputs for the scoring function. For example, $e_q = q_\star[CLS]$ in Karpukhin et al.

Although it might be efficient to compute, compressing m or n embeddings just into 1 embedding may lose information.

Late Interaction Aggregator

As shown in Figure 1 (c), the model preserves all of the document token embeddings {di} m i=1 in the cache until a new query is given.

It then computes token-wise matching scores using all of the document and query embeddings. The final matching score is generated by pooling the m × n scores.

However, the time complexity of the score computation arises from constant O(1) to quadratic O(mn).

Semi-interactive Aggregator

compresses the document token embeddings to a constant number k much smaller than the document length m (k << m).

Their Method

Firstly, following the semi-interactive aggregator, we feed the document tokens into BERT and use the last layer hidden states as the document token embeddings {di} m i=1. Next, we perform Kmeans algorithm on these token embeddings.

The K-means algorithm mainly contains two iterative steps: assignment step and update step. These two steps are performed alternatively until the convergence condition is satisfied.

The assignment step can be expressed by the following equation.

Update:

## Experiment

Evaluation: MS MARCO; Open QA (翻来覆去这个领域的 baseline 就这么几个)

# Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

## Motivation

While T5 achieves impressive performance on language tasks cast as sequence-to-sequence mapping problems, it is unclear how to produce sentence embeddings from encoder-decoder models.

We investigate three methods for extracting T5 sentence embeddings: two utilize only the T5 encoder and one uses the full T5 encoder-decoder model.

## Feature

We explore three ways of turning a pre-trained T5 encoder-decoder model into a sentence embedding model: (i) using the first token representation of the encoder; (ii) averaging all token representations from the encoder; (iii) using the first token representation from the decoder.

## Conclusion

• encoder-only models have strong transfer performance while encoderdecoder models perform better on textual similarity tasks
• We also demonstrate the effectiveness of scaling up the model size, which greatly improves sentence embedding quality

we target a retriever that can perform well on a wide variety of problems, without task-specific finetuning

• 什么是 Knowledge intensive task? 是任务集（？

KILT (Knowledge Intensive Language Tasks) is a new unified benchmark to help AI researchers build models that are better able to leverage real-world knowledge to accomplish a broad range of tasks.

## Motivation

Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data.

First, unlike tf-idf or BM25, neural retrieval models are unsuitable for low data regimes such as few- and zero-shot settings.

Second, task-specific retrievers complicate practical applications where multiple knowledge-intensive tasks may need to be performed using the same supporting database or over the same input text.

## Feature

By jointly training on an extensive selection of retrieval tasks, we obtain a model which is not only more robust than previous approaches, but also can lead to better performance on the downstream knowledge-intensive tasks when plugged into an existing system.

## Experiment

• The universal retriever performing comparably to task-specific models
• Plugged the universal retriever into a larger pipeline and achieved better results
• Evaluated the model’s performance in the zero-shot and few-shot settings.
• our proposed approach performs comparably to BM25 in the zero shot setting, and quickly overtakes it even with minimal in-domain training
• In Section 4.5 we evaluated a number of more complex variants of the model involving task specialisation, but failed to see clear performance improvements. Finally, in Section 4.6 we saw how a simple iterative approach to data augmentation can lead to better performance.

// 下周组会要分享论文，寄寄寄，总不能讲这些 21 年及之前的老货色吧，下周看起来要顶着软工 init project 的时候多找几篇论文了

// 读了也不算很多论文，但连 BERT 都没上手跑过几次，搞完挑战杯一定要上手写代码了，不然感觉还是太理论，太泛泛而谈了，丝毫没感到码力有提升（x

// 计网原小作业都要读 TCP/IP 的论文 不会吧不会吧 计网原我 *