正如篇名,是机器学习笔记的笔记。主要还是记录一些重要概念和思考的过程吧。

笔记的笔记中,第一个笔记的链接:

对应视频的链接:

此外,本文中还收录了部分自监督学习的模型,如 BERT 和 GPT 的运作模式。本文还记录了学习 Transformer 模型的过程。

机器学习笔记

机器学习介绍

发展历程及基础概念

  • 在存在深度学习之前,通过 hand-crafted rules 来设定过滤规则
  • 机器学习的过程
    • Training
      • Define a set of functions as Model
      • Evaluate the goodness of these functions
      • Pick the best function $f^$ from the *Model
    • Testing
      • Using $f^*$

img

相关技术

  • 监督学习 Supervised learning
    • Tasks
      • Regression
        • The output of the target function $f$ is scalar.
      • Classification
        • Binary classification (Output: yes/no)
        • Multi-class classification
      • Structured Learning
        • The output is well-structured.
    • How to select function set?
      • Non-linear model, the most famous of which is Deep Learning
      • Other non-linear models, like SVM…
  • 半监督学习 Semi-supervised Learning
    • non-labelled data
  • 迁移学习 Transfer Learning
    • Pictures that are not related to the topic could help…?
  • 无监督学习 Unsupervised Learning
  • 强化学习 Reinforcement Learning
    • 我们没有告诉机器正确的答案是什么,机器所拥有的只有一个分数,就是他做的好还是不好

Regression

找到函数$f$,使得对于任意给定特征$x$,输出数值$scalar$.

Steps

  • 模型假设,选择模型框架(线性模型)
  • 模型评估,如何判断众多模型的好坏(损失函数)
  • 模型优化,如何筛选最优的模型(梯度下降)

可能出现的问题

  • 过拟合
  • Customize learning rate

步骤优化

  • 合并多个线性模型,使用 $\delta$ 函数
  • 给予更多参数
  • 正则化 Regularization
    • $L=\sum_n (y-(b+\sum_iw_ix_i)) + \lambda\sum w_i^2$
    • 使得 Loss function 更加平滑

分析误差

通过分析误差的来源,以期达到改善模型时有着手点的效果.

Concept

  • Average Error = error due to “bias” + error due to variance
  • Notation
    • $\hat f$ := the actual function
    • $f^*$:= the best function picked from the model trained from the training data
    • $f^*$ is an $estimator$ of $\hat f$​

image-20210908141646887

img

  • Conclusion
    • Simple Model
      • Large Bias
      • Small Variance
    • Complex Model
      • Small Bias
      • Large Variance

image-20210908143937650

How to diagnose?

  • If your model cannot fit the training data, then you have large bias. (Underfitting)
    • Redesign your model.
      • Add more features…
      • A more complex model…
  • If your model can fit the training data, but has large error on the testing data, then you have the large variance. (Overfitting)
    • More data.
      • Effective but not always practical.
    • Regularization. (May do harm to bias)

Cross Validation

  • In each epoch, divide your training set into training set and validation set.
  • Use training set to train your model, and use validation set to pick the best one.
  • Then by this way, the average error of testing set could represent the real error when the model is applied.
  • What if the validation set has its biases? N-fold Cross Validation

image-20210908145850518

  • Firstly pick the best model using the validation approach, then train it using the whole training set.

Gradient Descent

Tuning learning rates

Visualize the figure of the loss and the turn of parameters updated

image-20210908153835553

Adaptive learning rates

  • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
    • (E.g.) $\frac 1 t \ Decay$: $\eta^{(t)} = \frac {\eta^{(0)}} {\sqrt {t+1}}$
    • But tuning learning rate cannot be one-size for all parameters. That is to say, we need to give different parameters different learning rates.
  • Adagrad

    • Divide the learning rate of each parameter by the root mean square of its previous derivatives.
    • Vanilla gradient descent: $w^{(t+1)} \leftarrow w^{(t)} - \eta^{(t)}g^{(t)}$​​​, $t\ge0$.
      • $g$​​​ is partial derivatives
    • Adagrad: $w^{(t+1)} \leftarrow w^{(t)} - \frac {\eta^{(t)}}{\sigma^{(t)}}g^{(t)}$, $t\ge0$.
      • $\sigma^{(t)} = \sqrt{\frac 1 {t+1} \sum_{i=0}^t[(g^{(i)})^2]}$
    • If we use $\frac 1 t\ Decay$ and $Adagrad$ together, we could easily have:
      • $w^{(t+1)} = w^{(t)}-\frac{\eta^{(0)}}{\sqrt {\sum_{i=0}^t[(g^{(i)})^2]}}g^{(t)}$​
  • The best step is $\frac{|一阶偏导|}{二阶偏导}$

Stochastic Gradient Descent

随机梯度下降法. Make your training faster.

  • 每处理一个例子就更新.

Feature Scaling 特征缩放

image-20210908161121857

  • 两个输入的分布的范围很不一样,建议把他们的范围缩放,使得不同输入的范围是一样的.

image-20210908161952204

Possible Problems

img

Classification 概率分类模型

回归模型与概率模型

  • 回归模型有其缺陷.
  • Ideal Alternatives
    • img

Generative Model

img

  • 如何进行问题的转化?
    • 两个盒子中抽一个球,抽到的是盒子1中蓝色球的概率是多少?
    • 相当于两个类别中抽一个 x,抽到的是类别1中 x 的概率是多少?
    • 可以转化成,随机给出一个 x,那么它属于哪一个类别(属于概率相对比较大的类别)?
      • If $P(C_1|x) \ge 0.5$, then output $C_1$​.
      • Else output $C_2$​.
  • Prior
    • 计算 $P(C_1), P(C_2)$: $P(C_1) = N(C_1)/N(All)$
  • Probability from Class?
  • 那么如何找 $\mu$ 与 $\Sigma$ ?Maximum Likelihood, 最大似然估计.
    • Likelihood of a Gaussian with mean $\mu$ and covariance matrix $\Sigma$:
    • $L(\mu, \Sigma) = \Pi_{i=1}^n f_{\mu, \Sigma}(x^{(i)})$
    • Assume that $\mu^, \Sigma^$ is the argument of the Gaussian Distribution with the maximum likelihood.
    • And the solution…
      • $\mu^* = \frac 1 n \sum_{i=1}^nx^{(i)}$
      • $\Sigma ^{} = \frac 1 n \sum_{i=1}^n (x^{(i)}-\mu^)(x^{(i)}-\mu^*)^T$​}
  • Modifying Model

    • Using the same Covariant Matrix
    • $L(\mu^1, \mu^2, \Sigma)$
      • Where $\Sigma = \frac {N(C_1)} {N(All)}\Sigma^1 + \frac {N(C_2)} {N(All)}\Sigma^2$ ​
    • 经过推导,我们的 Model 可以写成 $P_{w,b}(C_1|x) = \sigma(z), z=w \cdot x + b$
  • 假设所有的feature都是相互独立产生的,这种分类叫做 Naive Bayes Classifier(朴素贝叶斯分类器)

Logistic Regression

  • Review
    • Step 1: Function Set
      • We want to find $P_{w,b}(C_1|x)$​.
        • If $P_{w,b}(C_1|x) \ge 0.5$ then output $C_1$, else output $C_2$.
      • $f_{w,b}(x):=P_{w,b}(C_1|x) = \sigma(z)$, where $z=w\cdot x + b$​
    • Step 2: Goodness of the function
      • image-20210909203309727
      • img
      • img
      • Cross Entropy
    • Step 3: Find the best
      • image-20210909205827591
      • The partial derivatives are the same as those in linear regression.
      • The logistic regression is called discriminative method.

Discriminative v.s. Generative

  • Same model. $P(C_1|x) = \sigma(w\cdot x + b)$
    • Logistic Regression: Directly find $w$ and $b$
    • Generative Model: Find $\mu^1$, $\mu^2$, $\Sigma^{-1}$
    • But we won’t obtain the same set of $w$ and $b$.

Multi-class Classification

Softmax

img

  • Definition of the target

img

Limitation of logistic regression

Feature Transformation

img

  • Middle Layer!
    • 可以将很多的逻辑回归接到一起,就可以进行特征转换.
    • img

Deep Learning

Step 1: Neural Network

  • Fully Connect Feedforward Network
    • Why call it deep? Deep = Many Hidden Layers
    • 本质:通过 Hidden Layers 进行 Feature Transformation

Step 2: Loss Function

  • Cross Entropy

Step 3: Find the best function

  • Gradient Descent

Why deep?

  • More parameters, better performance
  • Universality Theorem

    • Any continuous function $f:R^N \rightarrow R^M$​ can be realized by a network with one hidden layer with enough neurons.
    • So why Deep Learning, not Fat Learning?

    CNN (Convolutional Neuronal Network)

Why CNN for image

  • Some patterns are much smaller than the whole image. That is to say, a neuron only need to have connection with a small region of the image, but not the whole image.
  • But the same pattern may appear in different regions in different images.
    • We could let these neurons share their parameters…
  • Subsampling the pixels will not change the object.

The whole CNN architecture

Image -> (Convolution -> Max Pooling)$^{+}$ -> Flatten -(as input)-> Fully Connected Feedforward Network

Convolution

  • Calculation approaches
  • Feature Map
  • Colorful Image

image-20211002101802253

  • Channel: 颜色通道
  • What does CNN learn? 使用 gradient ascent 寻找 input $x^{*} = arg \max_x{a^k}$
  • Deep Dream: let CNN exaggerate what it sees
  • 以 Alpha Go 为例讲解 Architecture 的可选择性

RNN (Recurrent Neural Network)

  • Example Application
    • Slot filling: Nerual Network needs memory!
    • image-20211002120218869
    • Bidirectional RNN
      • image-20211002120401629
      • Why? Have a broader view of context. 正向只看前面,反向只看后面.
    • Long Short-term Memory (LSTM)
      • Input signal and output signal are learned by the network itself.
      • image-20211002120742395
      • image-20211002121407528
    • How to train RNN?
      • Loss function? Sum over cross entropy.
      • Learning? Gradient descent.
        • How to calc partial derivative? BPTT (Back propagation through time).
        • The error surface may be very flat or very steep. Clipping…
        • LSTM may deal with gradient vanishing, but not with gradient explode.
    • Applications

Semi-supervised Learning 半监督学习

Introduction

  • Semi-supervised learning $\{(x^r, \hat y^r)\}_{r=1}^R, \{x^u\}_{u=R}^{R+U}$
    • A set of unlabeled data, usually $U\gg R$
  • Classification 半监督学习的分类
    • Transductive learning: unlabelled data is the testing data
    • Inductive learning: unlabelled data is not in the testing data
  • Why semi-supervised learning?
    • Collecting data is easy, but collecting “labelled” data is expensive
    • We do semi-supervised learning in our lives

Semi-supervised learning for Generative Model

image-20211003005516326

Low-density separation assumption

  • Assumption
    • 两个 Class 之间非黑即白 (Black or white)

image-20211003010157597

image-20211003011038595

Smoothness assumption

  • 近朱者赤 近墨者黑
    • “similar” x has the same $\hat y$
    • More precisely:
      • x is not uniform.
      • if $x^1$ and $x^2$​ are close in a high density region (connected by a high density path)
      • then $y^1$ and $y^2$ are the same

自监督学习

以 BERT 和 GPT 为例分析自监督式学习的架构。

BERT

Introduction

  • BERT is a kind of transformer encoder.

Basic Steps

  • Mask
    • Randomly masking some tokens.
    • Randomly replacing some tokens.

image-20211003222046934

  • Train goal:

image-20211003222105101

  • Next sentence prediction

image-20211003222412031

Fine-tuning for downstream-tasks

image-20211003223325577

GLUE: General Language Understanding Evaluation

image-20211003234413065

How to use BERT

  • Case 1
    • Input a sequence, output a class.
      • Sentimental Analysis

image-20211003234839607

  • Case 2
    • Input a sequence and output a sequence of the same length.
      • POS tagging

image-20211003235251180

  • Case 3

    • Input two sequences and output a class
      • Natural Language Inference (NLI)
  • Case 4

    • Input a sequence and output a sequence
      • Extraction-based Question Answering

image-20211004215647151

image-20211004220019462

Why does BERT work?

  • The tokens with similar meanings have similar embeddings.
  • You shall know a word by the company it keeps.

事实上我们可以把 BERT 各层包含的内容抽取出来做 linear combination,然后交给某些特定的任务,以此来推断 BERT 各层到底都在学什么东西。

GPT-2

Architecture

GPT-2 类似于 Transformer 的 Decoder 架构。

image-20220207145205606

Predict Next Token

image-20211004225110571

架构像是 Transformer 的 Decoder,取消 cross attention.

How to use GPT?

image-20211004230723769

GPT-3

Aimed at few-shot learning. 170 B size of parameters.

image-20220207150320965

Reference

Transformer

Introduction

  • Transformer is a sequence to sequence model. $Seq2seq$
  • Application…
    • Speech Recognition
    • Machine Translation
    • Speech Translation
    • Syntactic Parsing(文法剖析)
    • Multi-label Classification
  • Most NLP applications could be considered as Question Answering
  • 比起单纯用 seq2seq,Task-specific model 对于某些任务更合适
  • Architecture
    • image-20211003151722794

Encoder

  • Given a set of vectors and output another set of vectors.

    • Self-attention, CNN, RNN… All of them could realize this!
    • image-20211003151915990
  • image-20211003152146306

  • Residual Connection
    • image-20211003152616828

Decoder

(1) Autoregressive (AT)

image-20211003153626358

image-20211003153834166

  • Architecture

image-20211003153922255

  • Masked Multi-head Attention?
    • Like RNN…
    • image-20211003154121360
  • Adding “Stop Token”

(2) Non-autoregressive (NAT)

image-20211003155032111

Encoder-Decoder

image-20211003155352197

image-20211003155455569

Training

  • Minimize the cross entropy
    • The input of the decoder is the ground truth.

image-20211003160151665

Tips for training

  • Copy Mechanism
    • Chat-bot
  • Guided Attention

    • Monotonic attention
    • Location-aware attention
  • Beam Search

image-20211003161025399

image-20211003161307181

  • Scheduled Sampling

Reference