# 机器学习笔记

## 机器学习介绍

### 发展历程及基础概念

• 在存在深度学习之前，通过 hand-crafted rules 来设定过滤规则
• 机器学习的过程
• Training
• Define a set of functions as Model
• Evaluate the goodness of these functions
• Pick the best function $f^$ from the *Model
• Testing
• Using $f^*$ ### 相关技术

• 监督学习 Supervised learning
• Regression
• The output of the target function $f$ is scalar.
• Classification
• Binary classification (Output: yes/no)
• Multi-class classification
• Structured Learning
• The output is well-structured.
• How to select function set?
• Non-linear model, the most famous of which is Deep Learning
• Other non-linear models, like SVM…
• 半监督学习 Semi-supervised Learning
• non-labelled data
• 迁移学习 Transfer Learning
• Pictures that are not related to the topic could help…?
• 无监督学习 Unsupervised Learning
• 强化学习 Reinforcement Learning
• 我们没有告诉机器正确的答案是什么，机器所拥有的只有一个分数，就是他做的好还是不好

## Regression

### Steps

• 模型假设，选择模型框架（线性模型）
• 模型评估，如何判断众多模型的好坏（损失函数）
• 模型优化，如何筛选最优的模型（梯度下降）

### 可能出现的问题

• 过拟合
• Customize learning rate

### 步骤优化

• 合并多个线性模型，使用 $\delta$ 函数
• 给予更多参数
• 正则化 Regularization
• $L=\sum_n (y-(b+\sum_iw_ix_i)) + \lambda\sum w_i^2$
• 使得 Loss function 更加平滑

## 分析误差

### Concept

• Average Error = error due to “bias” + error due to variance
• Notation
• $\hat f$ := the actual function
• $f^*$:= the best function picked from the model trained from the training data
• $f^*$ is an $estimator$ of $\hat f$​  • Conclusion
• Simple Model
• Large Bias
• Small Variance
• Complex Model
• Small Bias
• Large Variance How to diagnose?

• If your model cannot fit the training data, then you have large bias. (Underfitting)
• A more complex model…
• If your model can fit the training data, but has large error on the testing data, then you have the large variance. (Overfitting)
• More data.
• Effective but not always practical.
• Regularization. (May do harm to bias)

### Cross Validation

• In each epoch, divide your training set into training set and validation set.
• Use training set to train your model, and use validation set to pick the best one.
• Then by this way, the average error of testing set could represent the real error when the model is applied.
• What if the validation set has its biases? N-fold Cross Validation • Firstly pick the best model using the validation approach, then train it using the whole training set.

### Tuning learning rates

Visualize the figure of the loss and the turn of parameters updated • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
• (E.g.) $\frac 1 t \ Decay$: $\eta^{(t)} = \frac {\eta^{(0)}} {\sqrt {t+1}}$
• But tuning learning rate cannot be one-size for all parameters. That is to say, we need to give different parameters different learning rates.

• Divide the learning rate of each parameter by the root mean square of its previous derivatives.
• Vanilla gradient descent: $w^{(t+1)} \leftarrow w^{(t)} - \eta^{(t)}g^{(t)}$​​​, $t\ge0$.
• $g$​​​ is partial derivatives
• Adagrad: $w^{(t+1)} \leftarrow w^{(t)} - \frac {\eta^{(t)}}{\sigma^{(t)}}g^{(t)}$, $t\ge0$.
• $\sigma^{(t)} = \sqrt{\frac 1 {t+1} \sum_{i=0}^t[(g^{(i)})^2]}$
• If we use $\frac 1 t\ Decay$ and $Adagrad$ together, we could easily have:
• $w^{(t+1)} = w^{(t)}-\frac{\eta^{(0)}}{\sqrt {\sum_{i=0}^t[(g^{(i)})^2]}}g^{(t)}$​
• The best step is $\frac{|一阶偏导|}{二阶偏导}$

• 每处理一个例子就更新.

### Feature Scaling 特征缩放 • 两个输入的分布的范围很不一样，建议把他们的范围缩放，使得不同输入的范围是一样的. ### Possible Problems ## Classification 概率分类模型

### 回归模型与概率模型

• 回归模型有其缺陷.
• Ideal Alternatives
• ### Generative Model • 如何进行问题的转化?
• 两个盒子中抽一个球，抽到的是盒子1中蓝色球的概率是多少？
• 相当于两个类别中抽一个 x，抽到的是类别1中 x 的概率是多少？
• 可以转化成，随机给出一个 x，那么它属于哪一个类别（属于概率相对比较大的类别）？
• If $P(C_1|x) \ge 0.5$, then output $C_1$​.
• Else output $C_2$​.
• Prior
• 计算 $P(C_1), P(C_2)$: $P(C_1) = N(C_1)/N(All)$
• Probability from Class?
• 那么如何找 $\mu$ 与 $\Sigma$ ？Maximum Likelihood, 最大似然估计.
• Likelihood of a Gaussian with mean $\mu$ and covariance matrix $\Sigma$:
• $L(\mu, \Sigma) = \Pi_{i=1}^n f_{\mu, \Sigma}(x^{(i)})$
• Assume that $\mu^, \Sigma^$ is the argument of the Gaussian Distribution with the maximum likelihood.
• And the solution…
• $\mu^* = \frac 1 n \sum_{i=1}^nx^{(i)}$
• $\Sigma ^{} = \frac 1 n \sum_{i=1}^n (x^{(i)}-\mu^)(x^{(i)}-\mu^*)^T$​}
• Modifying Model

• Using the same Covariant Matrix
• $L(\mu^1, \mu^2, \Sigma)$
• Where $\Sigma = \frac {N(C_1)} {N(All)}\Sigma^1 + \frac {N(C_2)} {N(All)}\Sigma^2$ ​
• 经过推导，我们的 Model 可以写成 $P_{w,b}(C_1|x) = \sigma(z), z=w \cdot x + b$
• 假设所有的feature都是相互独立产生的，这种分类叫做 Naive Bayes Classifier（朴素贝叶斯分类器）

### Logistic Regression

• Review
• Step 1: Function Set
• We want to find $P_{w,b}(C_1|x)$​.
• If $P_{w,b}(C_1|x) \ge 0.5$ then output $C_1$, else output $C_2$.
• $f_{w,b}(x):=P_{w,b}(C_1|x) = \sigma(z)$, where $z=w\cdot x + b$​
• Step 2: Goodness of the function
• • • • Cross Entropy
• Step 3: Find the best
• • The partial derivatives are the same as those in linear regression.
• The logistic regression is called discriminative method.

### Discriminative v.s. Generative

• Same model. $P(C_1|x) = \sigma(w\cdot x + b)$
• Logistic Regression: Directly find $w$ and $b$
• Generative Model: Find $\mu^1$, $\mu^2$, $\Sigma^{-1}$
• But we won’t obtain the same set of $w$ and $b$.

### Multi-class Classification

#### Softmax • Definition of the target ### Limitation of logistic regression

#### Feature Transformation • Middle Layer!
• 可以将很多的逻辑回归接到一起，就可以进行特征转换.
• ## Deep Learning

Step 1: Neural Network

• Fully Connect Feedforward Network
• Why call it deep? Deep = Many Hidden Layers
• 本质：通过 Hidden Layers 进行 Feature Transformation

Step 2: Loss Function

• Cross Entropy

Step 3: Find the best function

### Why deep?

• More parameters, better performance
• Universality Theorem

• Any continuous function $f:R^N \rightarrow R^M$​ can be realized by a network with one hidden layer with enough neurons.
• So why Deep Learning, not Fat Learning?

## CNN (Convolutional Neuronal Network）

### Why CNN for image

• Some patterns are much smaller than the whole image. That is to say, a neuron only need to have connection with a small region of the image, but not the whole image.
• But the same pattern may appear in different regions in different images.
• We could let these neurons share their parameters…
• Subsampling the pixels will not change the object.

### The whole CNN architecture

Image -> (Convolution -> Max Pooling)$^{+}$ -> Flatten -(as input)-> Fully Connected Feedforward Network

#### Convolution

• Calculation approaches
• Feature Map
• Colorful Image • Channel: 颜色通道
• What does CNN learn? 使用 gradient ascent 寻找 input $x^{*} = arg \max_x{a^k}$
• Deep Dream: let CNN exaggerate what it sees
• 以 Alpha Go 为例讲解 Architecture 的可选择性

## RNN (Recurrent Neural Network)

• Example Application
• Slot filling: Nerual Network needs memory!
• • Bidirectional RNN
• • Why? Have a broader view of context. 正向只看前面，反向只看后面.
• Long Short-term Memory (LSTM)
• Input signal and output signal are learned by the network itself.
• • • How to train RNN?
• Loss function? Sum over cross entropy.
• How to calc partial derivative? BPTT (Back propagation through time).
• The error surface may be very flat or very steep. Clipping…
• LSTM may deal with gradient vanishing, but not with gradient explode.
• Applications

## Semi-supervised Learning 半监督学习

### Introduction

• Semi-supervised learning $\{(x^r, \hat y^r)\}_{r=1}^R, \{x^u\}_{u=R}^{R+U}$
• A set of unlabeled data, usually $U\gg R$
• Classification 半监督学习的分类
• Transductive learning: unlabelled data is the testing data
• Inductive learning: unlabelled data is not in the testing data
• Why semi-supervised learning?
• Collecting data is easy, but collecting “labelled” data is expensive
• We do semi-supervised learning in our lives

### Semi-supervised learning for Generative Model ### Low-density separation assumption

• Assumption
• 两个 Class 之间非黑即白 (Black or white)  ### Smoothness assumption

• 近朱者赤 近墨者黑
• “similar” x has the same $\hat y$
• More precisely:
• x is not uniform.
• if $x^1$ and $x^2$​ are close in a high density region (connected by a high density path)
• then $y^1$ and $y^2$ are the same

# 自监督学习

## BERT

### Introduction

• BERT is a kind of transformer encoder.

### Basic Steps

• Randomly replacing some tokens. • Train goal: • Next sentence prediction  ### GLUE: General Language Understanding Evaluation ### How to use BERT

• Case 1
• Input a sequence, output a class.
• Sentimental Analysis • Case 2
• Input a sequence and output a sequence of the same length.
• POS tagging • Case 3

• Input two sequences and output a class
• Natural Language Inference (NLI)
• Case 4

• Input a sequence and output a sequence  ### Why does BERT work?

• The tokens with similar meanings have similar embeddings.
• You shall know a word by the company it keeps.

## GPT-2

### Architecture

GPT-2 类似于 Transformer 的 Decoder 架构。 ### Predict Next Token ### How to use GPT? ## GPT-3

Aimed at few-shot learning. 170 B size of parameters. # Transformer

## Introduction

• Transformer is a sequence to sequence model. $Seq2seq$
• Application…
• Speech Recognition
• Machine Translation
• Speech Translation
• Syntactic Parsing（文法剖析）
• Multi-label Classification
• Most NLP applications could be considered as Question Answering
• Architecture
• ## Encoder

• Given a set of vectors and output another set of vectors.

• Self-attention, CNN, RNN… All of them could realize this!
• • • Residual Connection
• ## Decoder

### (1) Autoregressive (AT)  • Architecture • Like RNN…
• ### (2) Non-autoregressive (NAT) ## Encoder-Decoder  ## Training

• Minimize the cross entropy
• The input of the decoder is the ground truth. ## Tips for training

• Copy Mechanism
• Chat-bot
• Guided Attention

• Monotonic attention
• Location-aware attention
• Beam Search  • Scheduled Sampling