主要是对 202202 读的 Paper 的总结。

涉及的主要主题是 Infomation Retrieval 中的 Dense Retrieval.

Pre-training Methods in Information Retrieval

Fan Y, Xie X, Cai Y, et al. Pre-training Methods in Information Retrieval[J]. arXiv preprint arXiv:2111.13853, 2021.

Summary

这是一篇研究预训练在 IR 中应用的文献综述

论文中首先介绍了 IR 是什么,然后介绍了预训练模式在 IR 中的应用,包括 Retrieval Component, Re-ranking Component 和 Other Component。

此外,还包括专门为 IR 定制的预训练任务的介绍。

然后,论文介绍了 IR 中使用预训练方法可能用到的资源,包括数据集 benchmark 和 leaderboard.

然后作者提出了目前 预训练 + IR 存在的挑战,给出了未来可能的工作前景。

Notes

Intro & Background

  • 什么是 IR?
    • 从一个大规模的答案语料集合中找回跟用户的请求有关的信息
    • 可能相关的结果有多个,所以需要定义“相关性得分”
  • 什么是预训练?(略去)
  • 从不同的视角来看待 IR 问题
    • Core Problem View: 核心的问题是计算 Query q 和 Document d 的相似度
    • Framework View: Retrieval Process,returns top-k most relative results
    • System View: Given a query q, output a sorted list of documents…

image-20220303221755076

Retrieval Component

在这里补充 Sparse Retrieval 和 Dense Retrieval 的补充知识。

不管是 Sparse Retrieval 还是 Dense Retrieval 都是使用某个处理程序将文书处理成某种特殊的表示,并对这种特殊的表示建立索引,下面我们详细介绍。

Sparse Retrieval 是指 token-level 的切分,常见的算法如 TF-IDF 和 BM25。这两种算法的运作方式在 https://c7w.tech/elasticsearch/ 的第一节进行过相关介绍,这里直接对其进行引用:

  • TF-IDF

TF 是指归一化的词频,IDF 是指逆文档频率。给定文档集合 $D$,有 $d_i \in D, 1 \le i \le n$.

文档集合总共包含 $m$ 个词,去除一些十分常见的词作为停用词(Stop Words),有 $w_i \in W, 1 \le i \le m$.

定义 TF 如下,即一篇文档中某个词出现的频率:

TF 只能描述词在文档中的频率,但假设现在有个词为“我们”,这个词可能在文档集 $D$ 中每篇文档中都会出现,并且有较高的频率。那么这一类词就不具有很好的区分文档的能力,为了降低这种通用词的作用,引入了 IDF:

于是我们综合这两部分, 便可以得到 TF-IDF:

TF 可以计算在一篇文档中词出现的频率,而 IDF 可以降低一些通用词的作用。因此对于一篇文档我们可以用文档中每个词的 TF−IDF 组成的向量来表示该文档,再根据余弦相似度这类的方法来计算文档之间的相关性。

  • BM25

BM25 是信息索引领域用来计算 query 与文档相似度得分的经典算法。

不同于 TF-IDF,BM25 的公式主要由三个部分组成:

  1. query 中每个单词 $q_i$ 与文档 $d$ 之间的相关性
  2. 单词 $q_i$ 与 query 之间的相似性
  3. 每个单词的权重

BM25 算法的一般公式:

其中 $Q$ 表示 query,$q_i \in Q$,$d$ 表示 document.

下展开介绍各部分公式:

  • $W_i$

其中 $N$ 是 document 总数,$df_i$ 表示含有 $q_i$ 的文档总数。

依据 IDF 的作用,对于某个 $q_i$ ,包含 $q_i$ 的文档数越多,说明 $q_i$ 重要性越小,或者区分度越低,IDF 越小,因此 IDF 可以用来刻画 $q_i$ 与文档的相似性。

  • $R(q_i, d)$

BM25 的设计依据一个重要的发现:词频和相关性之间的关系是非线性的,也就是说,每个词对于文档的相关性分数不会超过一个特定的阈值,当词出现的次数达到一个阈值后,其影响就不在线性增加了,而这个阈值会跟文档本身有关。

我们可以分成两部分来看待上述公式,其中 $f_i$ 为 $q_i$ 在 $d$ 中出现的次数,$k_1, k_2, K$ 是常数。

后一部分 $\dfrac {qf_i \cdot(k_2+1)}{qf_i+k_2}$ 在控制 $q_i$ 和 Query 的相似度。

前一部分在计算 $q_i$ 与 $d$ 的相似度,其中 $K = k_1 \cdot (1-b+b\cdot \dfrac {|d|}{AVG_n(|d|)})$,参数 $b$ 在调节文本长度对相关性的影响。

不失一般性地我们可以取 $k_1 = 2, k_2 = 0, b = 0.75$.

反正在接下来的运用也是大调库,调参数可以通过更改配置文件来进行。

写到这里发现之前 Promise 的 Elasticsearch 8.0 的教程还没开始写…下次一定下次一定

也就是说,我们把每篇 Document $d$ 首先进行 token-level 的切分并计算每个 token 的相应得分,建立起 token 对 document 的倒排索引。然后每当 Query $q$ 来临的时候,直接对 $q$ 进行切分,根据相应的倒排索引查询出对应分数加起来得到每篇文章的相似性得分,然后排序就好。

这里用到的数据结构就是这种倒排索引结构。

而 Dense Retrieval 则不同,正如其名字中的 “Dense” 所说,我们把每篇 Document $d$ 通过一个 Encoder. Say, BERT $\phi$, 然后通过 $\phi(d)[cls]$ 来作为其表示。

当 Query 来临的时候,我们将 Query q 通过相同的 BERT $\phi$ 得到 $\phi(q)$,我们要求解 top-k d 的集合使得 $\max Sim(\phi(q), \phi(d))$.

这里可以用到组织欧式空间内向量的数据结构,比如 FAISS,其实现是对欧式空间做分划。有点像搜索树的感觉?回头再详细读一读相关教程吧。用这个数据结构主要是为了找出距离某个向量最近的 k 个相同线性空间中的向量。

此外,为什么不用一个 BERT $\psi$ 做 $\psi( q + \text{‘[sep]’} + d)$ 呢?因为这样计算效率对于每个 q 都铁是 $O(d)$ 的,且因为每次计算都要过一个 BERT,复杂度极大。从时间效率上来说不考虑,后续我们 Re-ranker 中会这么用,因为我们已经取到了 top-k 了,这里 $k \ll d$.

下面是两种方法应用 Pretrain Method 的可能方法:

Sparse Retrieval

  • Term re-weighting: measure term weights with contextual semantics.
  • Document Expansion: expanding documents or queries.
  • Re-weighting + expansion
  • Sparse Representation learning

Dense Retrieval

  • Use pretrained models as encoders, then fine-tune them accordingly.
  • Use specific tasks to pretrain for IR
  • Fine-tuning: distill; using informative negative models;

Hybrid Retrieval

Re-ranker Component

  • Representation focused $relevance = f(PLM(Q), PLM(D))$
  • Interaction focused $relevance=f(PLM(Q,D))$

其中第二种就是上述提过的用 BERT $\psi$ 的方法。

Other Component

Query Understanding:

  • Query expansion
  • Query rewriting
  • Query suggestion$^*$
  • Search Clarification
  • Personalized Search

Document Summarization

  • Generic Document Summarization
  • Snippet Generation
  • Keyphrase Extraction

Latent Retrieval for Weakly Supervised Open Domain Question Answering

Lee K, Chang M W, Toutanova K. Latent retrieval for weakly supervised open domain question answering[J]. arXiv preprint arXiv:1906.00300, 2019.

Background Infomation

  • 什么是 Open Domain 的 QA?简称 ODQA,中文翻译为开放式问答,意为基于涵盖广泛主题的文本集合给出问题答案。

Definition: Formally speaking, to give an answer based on the document collection covering wide range of topics is called open-domain question answering (ODQA).

Challenges: The ODQA task combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer span from those articles).

Architecture: There are several approaches to the architecture of an ODQA system. A modular ODQA system consists of two components, the first one (the ranker) should be able to find the relevant articles in a database (e.g., Wikipedia), whereas the second one (the reader) extracts an answer from a single article or a small collection of articles retrieved by the ranker. In addition to the strictly two-component ODQA systems, there are hybrid systems that are based on several rankers where the last ranker in the pipeline is combined with an answer extraction module usually via reinforcement learning.

  • 什么是 Latent Variable?

In statistics, latent variables (from Latin: present participle of lateo (“lie hidden”), as opposed to observable variables) are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).

Motivation

现有的方法:要么 supervise 给定相应回答的证据,要么内嵌一个 IR 系统。We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system

(Review 现有的:① DrQA 2017,训练时用 question-answer-evidence pair,测试时抓一个 IR 系统过来生成 evidence. ② TriviaQA, SearchQA, Quasar,弱监督,也是依赖 IR 去生成 evidence)

image-20220206154839449

但是 QA 和 IR 不同,因为 IR 更关注词义和词性的 matching,但是 QA 更侧重于问题理解和答案理解。

Approach / Feature

In this work, we introduce the first Open Retrieval Question Answering system (ORQA). ORQA learns to retrieve evidence from an open corpus, and is supervised only by question-answer string pairs.

The key insight of this work is that end-to-end learning is possible if we pre-train the retriever with an unsupervised Inverse Cloze Task (ICT).

What is ICT? In ICT, a sentence is treated as a pseudo question, and its context is treated as pseudo evidence. Given a pseudo-question, ICT requires selecting the corresponding pseudo-evidence out of the candidates in a batch.

An important aspect of ORQA is its expressivity—it is capable of retrieving any text in an open corpus, rather than being limited to the closed set returned by a blackbox IR system.

Experiment

Architecture

依旧是魔改 BERT。 $Score = S_{retr} + S_{read}$

  • Retriever component
  • Reader component

但是目前存在的问题就是数据集太大了(Wikipedia),干扰也太多,简单的方法没法 Train。于是提出 ICT。

Training

  • Inverse Cloze Task: 一种 Pretrain 的方法

Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets.

先考虑为什么传统的 question-evidence 方法能奏效,这是因为首先 evidence 包含了 question 所要的信息,只不过是多含有了一些 question 不需要的信息。于是这种 question-context 的方法本质上是将与 question 在语义上相近的 context 取作 evidence.

于是我们就提出 Inverse Cloze Task. 注意到 Close Task(完形填空)就是基于 context 预测 masked 的文本。(ICT 任务是什么)而 ICT 则是给定一个句子,预测它的 context.

其中 q 是随机句子,BATCH-{b} 是随机 sample 出来做 negative samples 的,b 是 q 对应的 context.

Evaluation

Evaluation was carried out on the following datasets:

  • Natural Questions
  • WebQuestions
  • CuratedTrec
  • TriviaQA
  • SQuAD

Conclusion

We presented ORQA, the first open domain question answering system where the retriever and reader are jointly learned end-to-end using only question-answer pairs and without any IR system.

This is made possible by pre-training the retriever using an Inverse Cloze Task (ICT).

Experiments show that learning to retrieve is crucial when the questions reflect an information need, i.e. the question writers do not already know the answer.

Domain-matched pre-training tasks for dense retrieval

Motivation

IR is a exception that pre-training doesn’t produce convincing results. But with right setup, this barrier could be overcome.

So what is a right setup?

It’s been generally accepted that the more similar the end task is to the pre-training task, the larger the gains. We hypothesise that previously proposed pretraining tasks might be still too distant from the target task, which limits useful transfer.

Approach

We therefore investigate pre-training tasks for retrieval which are as closely matched to the the target task and domain as possible. To this end, we propose using two corpora for retrieval pre-training:

1) 65M synthetically generated question-answer pairs.
2) A corpus of 220 million post-comment pairs from Reddit, which we use for dialogue retrieval tasks.

Finally we can prove that:

  1. pre-training leads to strong achievements in both cases
  2. domain similarity and task similarity both matters
  3. the retrieval can benefit from larger models

Dense Retrieval

Bi-encoder architecture

Query encoder $E_Q$, passage encoder $E_p$, both output a fixed $d$-dim representation for each query / passage.

Passages are pre-processed offline, and their representations are indexed using a fast vector similarity search library such as FAISS(?)

Then when an query $q$ arrives we can use $E_Q(q)$ as its representation and use the index library to get the top-k closest passages.

Training

Given a query, a relevant (+) passage and a list of non-relevant (-) passages, the network is trained to minimize the negative log likelihood of picking the positive passage. And the probability assigned to each passage is proportional to $e^{sim(query, passage)}$.

We do training in two steps:

  • use a single BM25 negative per query
  • use hard negatives obtained using the first round model

Experimental setup

Pre-training tasks

  • PAQ
  • Reddit

Evaluation tasks

  • Passage retrieval
    • MSMARCO
    • Natural Questions
    • KILT
  • Dialogue retrieval (to show the generality of conclusions)
    • ConvAI2
    • Ubuntu v2
    • DSTC7

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Motivation

However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential.

  • iterative negative mining
  • multi-vector representations

In this paper, we identify and address two underlying problems of dense retrievers:

i) fragility to training data noise

ii) requiring large batches to robustly learn the embedding space.

Then we try to give a hypothesis about why RocketQA (denoising + large batch size) succeeded.

  • Denoising -> remove mislabelled samples
  • Large bs -> CLS vectors are not well trained, large training batches just helps the LM to learn to form the full embedding space.

Approach

We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. (?)

On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss (?) to warm up the passage embedding space.

Then we could think up a way that could realize the two goals without these two approaches.

  • Noise resistance -> use Condenser pre-training architecture
  • Introduce a corpus-level contrastive learning objective: at each training step sample text pairs; train the model such that the CLS embeddings of text pairs from the same doc are close and those from different documents are far apart.

-> Combinating the two, propose coCondenser pre-training method.

Experiment Method

Architecture

  • Based on Condenser

  • Added contrastive loss to loss function

Memory-efficient Pretraining

  • Gradient Caching

Pre-training

  • Universal
  • Corpus aware

Evaluation

  • Wikipedia
  • MS-MARCO

Sparse, Dense, and Attentional Representations for Text Retrieval

Luan Y, Eisenstein J, Toutanova K, et al. Sparse, dense, and attentional representations for text retrieval[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 329-345.

这篇文章在进行理论推导的时候比较偏数学证明.

Motivation

(首先对比 Dense Retrieval 与传统的 Sparse Retrieval.)Dual encoders perform retrieval by encoding documents and queries into dense low-dim vectors, scoring each document by its inner product with the query. We investigate the capacity of this architecture relative to sparse bag-of-words models and attentional neural networks.

下面是 Review 部分:

  • Sparse Retrieval: more recent work has adopted a two-stage retrieval and ranking pipeline, where a large number of documents are retrieved using sparse high dimensional query/document representations, and are further reranked with learned neural models
  • Dense Retrieval: A promising alternative is to perform first-stage retrieval using learned dense low-dimensional encodings of documents and queries. The dual encoder model scores each document by the inner product between its encoding and that of the query.

这两者的启发是不同的。Sparse Retrieval 更加看重的是 question 中的重点术语会与 retrieved document 中的术语重合,而 Dense Retrieval 更加关注的是语义上的相似度。

Analyzing dual encoder fidelity

这里 fidelity 可以理解为忠诚度,对原文术语的记忆程度。

And that is, how much can we compress the input while maintaining the ability to mimic the performance of bag-of-words retrieval?

Section 2 里证明了:Fidelity is important for the sub-problem of detecting precise term overlap, and is a tractable proxy for capacity. Using the theory of dimensionality reduction, we relate fidelity to the normalized margin between the gold retrieval result and its competitors, and show that this margin is in turn related to the length of documents in the collection. (没仔细看证明过程)。

Approach / Feature

Building on these insights, we propose a simple neural model that combines the efficiency of dual encoders with some of the expressiveness of more costly attentional architectures, and explore sparse-dense hybrids to capitalize on the precision of sparse retrieval. These models outperform strong alternatives in large-scale retrieval.

Multi-vector Encodings

The theoretical analysis suggests that fixed-length vector representations of documents may in general need to be large for long documents, if fidelity with respect to sparse high-dimensional representations is important.

image-20220206184818740

Hybrid

A natural approach to balancing between the fidelity of sparse representations and the generalization of learned dense ones is to build a hybrid.

To do this, we linearly combine a sparse and dense system’s scores using a single trainable weight λ, tuned on a development set.

Experiment

  • Retrieval for Open-domain QA

image-20220206185657503

  • Large Scale Supervised IR

image-20220206185728980

Conclusion

We have used both theoretical and empirical techniques to characterize the fidelity of fixed-length dual encoders, focusing on the role of document length.

Based on these observations, we propose hybrid models that yield strong performance while maintaining scalability.

Condenser: a pretraining architecture for dense retrieval

Motivation

However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations.

Reasons?

This paper finds a key reason is that standard LMs’ internal attention structure is not ready-to-use for dense encoders, which needs to aggregate text information into the dense representation.

Attention patterns, therefore, define how effective CLS can aggregate information.

In other words, the CLS token remains dormant in many middle layers and reactivates only in the last round of attention.

Approach

We propose to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation.

Experiment

Architecture

image-20220205153145624

  • Pre-train

重点是 Head 的设计,我们为了让 CLS 里塞入更多的信息,在 head 这一层把 LATE 的 CLS 和 EARLY 的 其他 OUTPUT 给 CAT 起来,作为输入塞给 Head 然后这里主要是为了调整 CLS 的表示力度。

为了避免 head 让 back 部分的 encoding 坏掉,loss 设置为 $L = L_{mlm} + L_{mlm}^c$.

$L_{mlm} = \sum_{i \in masked} CrossEntropy(Wh_i^{cd}, x_i)$

$L_{mlm}^c = \sum _ {i \in masked} CrossEntropy(Wh_i^{late}, x_i)$

  • Fine tune

fine tune 的时候直接把这个 head 给 drop 掉,变成了普普通通的 Transformer 模型.

Fine tuning

  1. Sentence Similarity

Semantic Textual Similarity Benchmark

Wikipedia Section Distinction

  1. Retrieval for Open QA
  • NQ
  • TriviaQA
  1. Retrieval for web search
  • MS MARCO

PRE-TRAINING TASKS FOR EMBEDDING-BASED LARGE-SCALE RETRIEVAL

Chang W C, Yu F X, Chang Y W, et al. Pre-training tasks for embedding-based large-scale retrieval[J]. arXiv preprint arXiv:2002.03932, 2020.

Motivation

Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied.

Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest.

Feature

In this paper, we conduct a comprehensive study on the embedding-based retrieval models. (Namely Dense Retrieval!)

We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three.

We contribute the following insight:

  • The two-tower Transformer models (Retrieval Stage + Reranking stage) with proper pre-training can significantly outperform the widely used BM-25 algorithm;
  • Paragraph-level pre-training tasks such as Inverse Cloze Task (ICT), Body First Selection (BFS), and Wiki Link Prediction (WLP) hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains ( marginal: small and not important)
  • The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart

From doc2query to docTTTTTquery

Nogueira R, Lin J, Epistemic A I. From doc2query to docTTTTTquery[J]. Online preprint, 2019, 6.

Motivation

Nogueira et al. [7] used a simple sequence-to-sequence transformer [9] for document expansion. We replace the transformer with T5 [8] and observe large effectiveness gains.

Document Expansion by Query Prediction

Motivation

One technique to improve the retrieval effectiveness of a search engine is to expand documents with terms that are related or representative of the documents’ content

Feature

Following this observation, we propose a simple method that predicts which queries will be issued for a given document and then expands it with those predictions with a vanilla sequence-to-sequence model, trained using datasets consisting of pairs of query and relevant documents.

  • Method [Doc2Query]: For each document, the task is to predict a set of queries for which that document will be relevant.
    • Given a dataset of (query, relevant document) pairs, we use a sequence-to-sequence transformer model (Vaswani et al., 2017) that takes as an input the document terms and produces a query.
    • The document and target query are segmented using BPE (Sennrich et al., 2015) after being tokenized with the Moses tokenizer.1
    • Once the model is trained, we predict 10 queries using top-k random sampling and append them to each document in the corpus.

然后用 BM25 作为 Retriever,增广后的 Document 代替原有 Document.

Experiment

Evaluation was carried out on:

  • MS MARCO
  • TREC-CAR

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Motivation

While remarkably effective, the ranking models based on these LMs increase computational cost by orders of magnitude over prior approaches, particularly as they must feed each query–document pair through a massive neural network to compute a single relevance score

Feature

To tackle this, we present ColBERT, a novel ranking model that adapts deep LMs (in particular, BERT) for efficient retrieval.

ColBERT introduces a late interaction architecture that independently encodes the query and the document using BERT and then employs a cheap yet powerful interaction step that models their fine-grained similarity.

Under late interaction, 𝑞 and 𝑑 are separately encoded into two sets of contextual embeddings, and relevance is evaluated using cheap and pruning-friendly computations between both sets—that is, fast computations that enable ranking without exhaustively evaluating every possible candidate.

image-20220209150841730

Complement Lexical Retrieval Model with Semantic Residual Embeddings

Feature

This paper presents clear, a retrieval model that seeks to complement classical lexical exact-match models such as BM25 with semantic matching signals from a neural embedding matching model.

Approach

clear consists of a lexical retrieval model and an embedding retrieval model. Between these two models, one’s weakness is the other’s strength: lexical retrieval performs exact token matching but cannot handle vocabulary mismatch; meanwhile, the embedding retrieval supports semantic matching but loses granular (lexical level) information.

To ensure that the two types of models work together and fix each other’s weakness, we propose a residual-based learning framework that teaches the neural embeddings to be complementary to the lexical retrieval.

Lexical Retrieval Model

BM25:

Embedding Retrieval Model

BERT: shared weight

Residual Based Learning

To make the best use of the embedding model, we must avoid the embedding model “relearning” signals already captured by the lexical model. Instead, we focus its capacity on semantic level matching missing in the lexical model.

一般的 Loss 函数:

where $[x]^+ = max\{0,x\}$

为了让 embedding 来 complement lexical retrieval,我们 propose 两个 techique:

  • Error-based Negative Sampling

Sample negative examples from those documents mistakenly retrieved by lexical retrieval.

Given a positive query-document pair, we uniformly sample irrelevant examples from the top N documents returned by lexical retrieval with probability p. With such negative samples, the embedding model learns to differentiate relevant documents from confusing ones that are lexically similar to the query but semantically irrelevant.

  • Residual-based Margin

Intuitively, different query-document pairs require different levels of extra semantic information for matching on top of exact matching signals.

Our negative sampling strategy does not tell the neural model the degree of error made by the lexical retrieval that it needs to fix.

于是做修改:

Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring

Motivation

现有的:Cross-encoders 和 Bi-encoders

The former often performs better, but is too slow for practical use.

Feature

In this work, we develop a new transformer architecture, the Poly-encoder, that learns global rather than token level self-attention features.

We introduce the Poly-encoder, an architecture with an additional learnt attention mechanism that represents more global features from which to perform self-attention, resulting in performance gains over Bi-encoders and large speed gains over Cross-Encoders

Poly-Encoder

A given candidate label is represented by one vector as in the Bi-encoder, which allows for caching candidates for fast inference time, while the input context is jointly encoded with the candidate, as in the Cross-encoder, allowing the extraction of more information.

The Poly-encoder uses two separate transformers for the context and label like a Bi-encoder, and the candidate is encoded into a single vector $y_{candi}$ .

As such, the Poly-encoder method can be implemented using a precomputed cache of encoded responses. However, the input context, which is typically much longer than a candidate, is represented with m vectors ($y^1_{ctxt}, \cdots, y^{m}_{ctxt}$) instead of just one as in the Bi-encoder, where m will influence the inference speed.

To obtain these m global features that represent the input, we learn m context codes $(c_1, \cdots, c_m)$, where $c_i$ extracts representation $y^i_{ctxt}$ by attending over all the outputs of the previous layer:

The m context codes are randomly initialized, and learnt during finetuning. Finally, given our m global context features, we attend over them using $y_{candi}$ as the query:

The final score for that candidate label is then $y_{ctxt} \cdot y_{candi}$ as in a Bi-encoder. As m < N, where N is the number of tokens, and the context-candidate attention is only performed at the top layer, this is far faster than the Cross-encoder’s full self-attention.

image-20220208163723909

但是…时间呢?

image-20220208164130924

好吧,虽然复杂度感觉不太对,但是他说比 Cross 好上那么几个数量级。Fine.

Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval

Motivation

However, this simple structure may cause serious information loss during the encoding of documents since the queries are agnostic.

As it is very common that a document with hundreds of tokens contains several distinct topics, some important semantic information might be easily missed or biased by each other without knowing the query.

Feature

To address this problem, we design a method to mimic the queries on each of the documents by an iterative clustering process and represent the documents by multiple pseudo queries.

To alleviate the query agnostic problem, we propose a novel approach that mimics multiple potential queries corresponding to the input document and we call them “pseudo query embeddings”.

Ideally, each of the pseudo query embeddings corresponds to a semantic salient (most important or noticeable) fragment in the document which is similar to a semantic cluster of the document.

Thus, we implement the process by a clustering algorithm (i.e., K-means in this work) and regard the cluster centroids as the pseudo query embeddings.

  • This is a novel approach to represent the document with multiple pseudo query embeddings which are generated by a clustering process.

Review: Aggregator

image-20220207164752721

Independent Aggregator

$q_\star$ and $d_\star$ are the direct output of the BERT layer. A pooler is needed to extract the inputs for the scoring function. For example, $e_q = q_\star[CLS]$ in Karpukhin et al.

Although it might be efficient to compute, compressing m or n embeddings just into 1 embedding may lose information.

Late Interaction Aggregator

As shown in Figure 1 (c), the model preserves all of the document token embeddings {di} m i=1 in the cache until a new query is given.

It then computes token-wise matching scores using all of the document and query embeddings. The final matching score is generated by pooling the m × n scores.

However, the time complexity of the score computation arises from constant O(1) to quadratic O(mn).

Semi-interactive Aggregator

compresses the document token embeddings to a constant number k much smaller than the document length m (k << m).

Their Method

Firstly, following the semi-interactive aggregator, we feed the document tokens into BERT and use the last layer hidden states as the document token embeddings {di} m i=1. Next, we perform Kmeans algorithm on these token embeddings.

The K-means algorithm mainly contains two iterative steps: assignment step and update step. These two steps are performed alternatively until the convergence condition is satisfied.

The assignment step can be expressed by the following equation.

Update:

实际上这就是 K-means Clustering 的算法,就是一堆上下标记号,没啥高大上的地方。

然后我们就把 $c_j^t$ 看作是 Query Embedding.

Experiment

Evaluation: MS MARCO; Open QA (翻来覆去这个领域的 baseline 就这么几个)

Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

Motivation

While T5 achieves impressive performance on language tasks cast as sequence-to-sequence mapping problems, it is unclear how to produce sentence embeddings from encoder-decoder models.

We investigate three methods for extracting T5 sentence embeddings: two utilize only the T5 encoder and one uses the full T5 encoder-decoder model.

Feature

目的就是一个句子塞进 T5 获取他的 Representation,塞的信息越多越好。

We explore three ways of turning a pre-trained T5 encoder-decoder model into a sentence embedding model: (i) using the first token representation of the encoder; (ii) averaging all token representations from the encoder; (iii) using the first token representation from the decoder.

Conclusion

  • encoder-only models have strong transfer performance while encoderdecoder models perform better on textual similarity tasks
  • We also demonstrate the effectiveness of scaling up the model size, which greatly improves sentence embedding quality

如果对 T5 进行进一步研究,能否提取出其每层的表示来做分析,查看到底是哪些层针对哪些任务起了作用?但是因为 T5 本来就是为了做 Universal 的,这个任务也不一定有价值…

Multi-task Retrieval for Intensive Tasks

  • 什么是 multi-task retrieval?

we target a retriever that can perform well on a wide variety of problems, without task-specific finetuning

  • 什么是 Knowledge intensive task? 是任务集(?

KILT (Knowledge Intensive Language Tasks) is a new unified benchmark to help AI researchers build models that are better able to leverage real-world knowledge to accomplish a broad range of tasks.

Motivation

Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data.

现有的 Dense Retrieval 的弱点:

First, unlike tf-idf or BM25, neural retrieval models are unsuitable for low data regimes such as few- and zero-shot settings.

Second, task-specific retrievers complicate practical applications where multiple knowledge-intensive tasks may need to be performed using the same supporting database or over the same input text.

Feature

By jointly training on an extensive selection of retrieval tasks, we obtain a model which is not only more robust than previous approaches, but also can lead to better performance on the downstream knowledge-intensive tasks when plugged into an existing system.

Experiment

  • The universal retriever performing comparably to task-specific models
  • Plugged the universal retriever into a larger pipeline and achieved better results
  • Evaluated the model’s performance in the zero-shot and few-shot settings.
    • our proposed approach performs comparably to BM25 in the zero shot setting, and quickly overtakes it even with minimal in-domain training
  • In Section 4.5 we evaluated a number of more complex variants of the model involving task specialisation, but failed to see clear performance improvements. Finally, in Section 4.6 we saw how a simple iterative approach to data augmentation can lead to better performance.

// 下周组会要分享论文,寄寄寄,总不能讲这些 21 年及之前的老货色吧,下周看起来要顶着软工 init project 的时候多找几篇论文了

// 读了也不算很多论文,但连 BERT 都没上手跑过几次,搞完挑战杯一定要上手写代码了,不然感觉还是太理论,太泛泛而谈了,丝毫没感到码力有提升(x

// 计网原小作业都要读 TCP/IP 的论文 不会吧不会吧 计网原我 *