《机器学习》笔记
正如篇名,是机器学习笔记的笔记。主要还是记录一些重要概念和思考的过程吧。
笔记的笔记中,第一个笔记的链接:
对应视频的链接:
此外,本文中还收录了部分自监督学习的模型,如 BERT 和 GPT 的运作模式。本文还记录了学习 Transformer 模型的过程。
机器学习笔记
机器学习介绍
发展历程及基础概念
- 在存在深度学习之前,通过 hand-crafted rules 来设定过滤规则
- 机器学习的过程
- Training
- Define a set of functions as Model
- Evaluate the goodness of these functions
- Pick the best function $f^$ from the *Model
- Testing
- Using $f^*$
- Training
相关技术
- 监督学习 Supervised learning
- Tasks
- Regression
- The output of the target function $f$ is scalar.
- Classification
- Binary classification (Output: yes/no)
- Multi-class classification
- Structured Learning
- The output is well-structured.
- Regression
- How to select function set?
- Non-linear model, the most famous of which is Deep Learning
- Other non-linear models, like SVM…
- Tasks
- 半监督学习 Semi-supervised Learning
- non-labelled data
- 迁移学习 Transfer Learning
- Pictures that are not related to the topic could help…?
- 无监督学习 Unsupervised Learning
- 强化学习 Reinforcement Learning
- 我们没有告诉机器正确的答案是什么,机器所拥有的只有一个分数,就是他做的好还是不好
Regression
找到函数$f$,使得对于任意给定特征$x$,输出数值$scalar$.
Steps
- 模型假设,选择模型框架(线性模型)
- 模型评估,如何判断众多模型的好坏(损失函数)
- 模型优化,如何筛选最优的模型(梯度下降)
可能出现的问题
- 过拟合
- Customize learning rate
步骤优化
- 合并多个线性模型,使用 $\delta$ 函数
- 给予更多参数
- 正则化 Regularization
- $L=\sum_n (y-(b+\sum_iw_ix_i)) + \lambda\sum w_i^2$
- 使得 Loss function 更加平滑
分析误差
通过分析误差的来源,以期达到改善模型时有着手点的效果.
Concept
- Average Error = error due to “bias” + error due to variance
- Notation
- $\hat f$ := the actual function
- $f^*$:= the best function picked from the model trained from the training data
- $f^*$ is an $estimator$ of $\hat f$
- Conclusion
- Simple Model
- Large Bias
- Small Variance
- Complex Model
- Small Bias
- Large Variance
- Simple Model
How to diagnose?
- If your model cannot fit the training data, then you have large bias. (Underfitting)
- Redesign your model.
- Add more features…
- A more complex model…
- Redesign your model.
- If your model can fit the training data, but has large error on the testing data, then you have the large variance. (Overfitting)
- More data.
- Effective but not always practical.
- Regularization. (May do harm to bias)
- More data.
Cross Validation
- In each epoch, divide your training set into training set and validation set.
- Use training set to train your model, and use validation set to pick the best one.
- Then by this way, the average error of testing set could represent the real error when the model is applied.
- What if the validation set has its biases? N-fold Cross Validation
- Firstly pick the best model using the validation approach, then train it using the whole training set.
Gradient Descent
Tuning learning rates
Visualize the figure of the loss and the turn of parameters updated
Adaptive learning rates
- Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
- (E.g.) $\frac 1 t \ Decay$: $\eta^{(t)} = \frac {\eta^{(0)}} {\sqrt {t+1}}$
- But tuning learning rate cannot be one-size for all parameters. That is to say, we need to give different parameters different learning rates.
Adagrad
- Divide the learning rate of each parameter by the root mean square of its previous derivatives.
- Vanilla gradient descent: $w^{(t+1)} \leftarrow w^{(t)} - \eta^{(t)}g^{(t)}$, $t\ge0$.
- $g$ is partial derivatives
- Adagrad: $w^{(t+1)} \leftarrow w^{(t)} - \frac {\eta^{(t)}}{\sigma^{(t)}}g^{(t)}$, $t\ge0$.
- $\sigma^{(t)} = \sqrt{\frac 1 {t+1} \sum_{i=0}^t[(g^{(i)})^2]}$
- If we use $\frac 1 t\ Decay$ and $Adagrad$ together, we could easily have:
- $w^{(t+1)} = w^{(t)}-\frac{\eta^{(0)}}{\sqrt {\sum_{i=0}^t[(g^{(i)})^2]}}g^{(t)}$
The best step is $\frac{|一阶偏导|}{二阶偏导}$
Stochastic Gradient Descent
随机梯度下降法. Make your training faster.
- 每处理一个例子就更新.
Feature Scaling 特征缩放
- 两个输入的分布的范围很不一样,建议把他们的范围缩放,使得不同输入的范围是一样的.
Possible Problems
Classification 概率分类模型
回归模型与概率模型
- 回归模型有其缺陷.
- Ideal Alternatives
Generative Model
- 如何进行问题的转化?
- 两个盒子中抽一个球,抽到的是盒子1中蓝色球的概率是多少?
- 相当于两个类别中抽一个 x,抽到的是类别1中 x 的概率是多少?
- 可以转化成,随机给出一个 x,那么它属于哪一个类别(属于概率相对比较大的类别)?
- If $P(C_1|x) \ge 0.5$, then output $C_1$.
- Else output $C_2$.
- Prior
- 计算 $P(C_1), P(C_2)$: $P(C_1) = N(C_1)/N(All)$
- Probability from Class?
- 我们假设 Training Data 中的数据全部是从一个 Gaussian Distribution 中 sample 出来.
- https://zh.wikipedia.org/zh-hans/%E5%A4%9A%E5%85%83%E6%AD%A3%E6%80%81%E5%88%86%E5%B8%83
- $f_{\mu, \Sigma}(x) = \frac 1 {(2\pi)^{D/2}}\frac 1 {|\Sigma|^{1/2}} exp\{-\frac 1 2 (x-\mu)^T\Sigma^{-1}(x-\mu)\}$
- 那么如何找 $\mu$ 与 $\Sigma$ ?Maximum Likelihood, 最大似然估计.
- Likelihood of a Gaussian with mean $\mu$ and covariance matrix $\Sigma$:
- $L(\mu, \Sigma) = \Pi_{i=1}^n f_{\mu, \Sigma}(x^{(i)})$
- Assume that $\mu^, \Sigma^$ is the argument of the Gaussian Distribution with the maximum likelihood.
- And the solution…
- $\mu^* = \frac 1 n \sum_{i=1}^nx^{(i)}$
- $\Sigma ^{} = \frac 1 n \sum_{i=1}^n (x^{(i)}-\mu^)(x^{(i)}-\mu^*)^T$}
Modifying Model
- Using the same Covariant Matrix
- $L(\mu^1, \mu^2, \Sigma)$
- Where $\Sigma = \frac {N(C_1)} {N(All)}\Sigma^1 + \frac {N(C_2)} {N(All)}\Sigma^2$
- 经过推导,我们的 Model 可以写成 $P_{w,b}(C_1|x) = \sigma(z), z=w \cdot x + b$
假设所有的feature都是相互独立产生的,这种分类叫做 Naive Bayes Classifier(朴素贝叶斯分类器)
Logistic Regression
- Review
- Step 1: Function Set
- We want to find $P_{w,b}(C_1|x)$.
- If $P_{w,b}(C_1|x) \ge 0.5$ then output $C_1$, else output $C_2$.
- $f_{w,b}(x):=P_{w,b}(C_1|x) = \sigma(z)$, where $z=w\cdot x + b$
- We want to find $P_{w,b}(C_1|x)$.
- Step 2: Goodness of the function
- Cross Entropy
- Step 3: Find the best
- The partial derivatives are the same as those in linear regression.
- The logistic regression is called discriminative method.
- Step 1: Function Set
Discriminative v.s. Generative
- Same model. $P(C_1|x) = \sigma(w\cdot x + b)$
- Logistic Regression: Directly find $w$ and $b$
- Generative Model: Find $\mu^1$, $\mu^2$, $\Sigma^{-1}$
- But we won’t obtain the same set of $w$ and $b$.
Multi-class Classification
Softmax
- Definition of the target
Limitation of logistic regression
Feature Transformation
- Middle Layer!
- 可以将很多的逻辑回归接到一起,就可以进行特征转换.
Deep Learning
Step 1: Neural Network
- Fully Connect Feedforward Network
- Why call it deep? Deep = Many Hidden Layers
- 本质:通过 Hidden Layers 进行 Feature Transformation
Step 2: Loss Function
- Cross Entropy
Step 3: Find the best function
- Gradient Descent
Why deep?
- More parameters, better performance
Universality Theorem
- Any continuous function $f:R^N \rightarrow R^M$ can be realized by a network with one hidden layer with enough neurons.
- So why Deep Learning, not Fat Learning?
CNN (Convolutional Neuronal Network)
Why CNN for image
- Some patterns are much smaller than the whole image. That is to say, a neuron only need to have connection with a small region of the image, but not the whole image.
- But the same pattern may appear in different regions in different images.
- We could let these neurons share their parameters…
- Subsampling the pixels will not change the object.
The whole CNN architecture
Image -> (Convolution -> Max Pooling)$^{+}$ -> Flatten -(as input)-> Fully Connected Feedforward Network
Convolution
- Calculation approaches
- Feature Map
- Colorful Image
- Channel: 颜色通道
- What does CNN learn? 使用 gradient ascent 寻找 input $x^{*} = arg \max_x{a^k}$
- Deep Dream: let CNN exaggerate what it sees
- 以 Alpha Go 为例讲解 Architecture 的可选择性
RNN (Recurrent Neural Network)
- Example Application
- Slot filling: Nerual Network needs memory!
- Bidirectional RNN
- Why? Have a broader view of context. 正向只看前面,反向只看后面.
- Long Short-term Memory (LSTM)
- Input signal and output signal are learned by the network itself.
- How to train RNN?
- Loss function? Sum over cross entropy.
- Learning? Gradient descent.
- How to calc partial derivative? BPTT (Back propagation through time).
- The error surface may be very flat or very steep. Clipping…
- LSTM may deal with gradient vanishing, but not with gradient explode.
- Applications
Semi-supervised Learning 半监督学习
Introduction
- Semi-supervised learning $\{(x^r, \hat y^r)\}_{r=1}^R, \{x^u\}_{u=R}^{R+U}$
- A set of unlabeled data, usually $U\gg R$
- Classification 半监督学习的分类
- Transductive learning: unlabelled data is the testing data
- Inductive learning: unlabelled data is not in the testing data
- Why semi-supervised learning?
- Collecting data is easy, but collecting “labelled” data is expensive
- We do semi-supervised learning in our lives
Semi-supervised learning for Generative Model
Low-density separation assumption
- Assumption
- 两个 Class 之间非黑即白 (Black or white)
Smoothness assumption
- 近朱者赤 近墨者黑
- “similar” x has the same $\hat y$
- More precisely:
- x is not uniform.
- if $x^1$ and $x^2$ are close in a high density region (connected by a high density path)
- then $y^1$ and $y^2$ are the same
自监督学习
以 BERT 和 GPT 为例分析自监督式学习的架构。
BERT
Introduction
- BERT is a kind of transformer encoder.
Basic Steps
- Mask
- Randomly masking some tokens.
- Randomly replacing some tokens.
- Train goal:
- Next sentence prediction
Fine-tuning for downstream-tasks
GLUE: General Language Understanding Evaluation
How to use BERT
- Case 1
- Input a sequence, output a class.
- Sentimental Analysis
- Input a sequence, output a class.
- Case 2
- Input a sequence and output a sequence of the same length.
- POS tagging
- Input a sequence and output a sequence of the same length.
Case 3
- Input two sequences and output a class
- Natural Language Inference (NLI)
- Input two sequences and output a class
Case 4
- Input a sequence and output a sequence
- Extraction-based Question Answering
- Input a sequence and output a sequence
Why does BERT work?
- The tokens with similar meanings have similar embeddings.
- You shall know a word by the company it keeps.
事实上我们可以把 BERT 各层包含的内容抽取出来做 linear combination,然后交给某些特定的任务,以此来推断 BERT 各层到底都在学什么东西。
GPT-2
Architecture
GPT-2 类似于 Transformer 的 Decoder 架构。
Predict Next Token
架构像是 Transformer 的 Decoder,取消 cross attention.
How to use GPT?
GPT-3
Aimed at few-shot learning. 170 B size of parameters.
Reference
Transformer
Introduction
- Transformer is a sequence to sequence model. $Seq2seq$
- Application…
- Speech Recognition
- Machine Translation
- Speech Translation
- Syntactic Parsing(文法剖析)
- Multi-label Classification
- Most NLP applications could be considered as Question Answering
- 比起单纯用 seq2seq,Task-specific model 对于某些任务更合适
- Architecture
Encoder
Given a set of vectors and output another set of vectors.
- Self-attention, CNN, RNN… All of them could realize this!
- Residual Connection
Decoder
(1) Autoregressive (AT)
- Architecture
- Masked Multi-head Attention?
- Like RNN…
- Adding “Stop Token”
(2) Non-autoregressive (NAT)
Encoder-Decoder
Training
- Minimize the cross entropy
- The input of the decoder is the ground truth.
Tips for training
- Copy Mechanism
- Chat-bot
Guided Attention
- Monotonic attention
- Location-aware attention
Beam Search
- Scheduled Sampling