实现 MNIST 鉴别手写数字过程手记
在学习了相关的理论基础之后,我们自然是要投入到实践之中,完成这项机器学习领域的 Hello World 任务.
本篇文章会用来记录 c7w 操作的全过程,以及他试图使用 python3 来拟合一篇 python2 教程的痛苦.
MNIST Classifier Implemention
Fetch dataset
Download
下载数据包,前往 MNIST 官方网站,把 Trainset 和 Testset 的两组数据全部下载。然后去摸鱼
数据格式
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
>[offset] [type] [value] [description]` >`0000 32 bit integer 0x00000801(2049) magic number (MSB first)` >`0004 32 bit integer 60000 number of items` >`0008 unsigned byte ?? label` >`0009 unsigned byte ?? label` >`........` >`xxxx unsigned byte ?? label >The labels values are 0 to 9.
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
>[offset] [type] [value] [description]` >`0000 32 bit integer 0x00000803(2051) magic number` >`0004 32 bit integer 60000 number of images` >`0008 32 bit integer 28 number of rows` >`0012 32 bit integer 28 number of columns` >`0016 unsigned byte ?? pixel` >`0017 unsigned byte ?? pixel` >`........` >`xxxx unsigned byte ?? pixel
Use Pickle to load dataset
Define network
Init function
np.random.randn(d0, d1, ...)
生成随机初值
生成均值为 0,标准差为 1 的正态分布
>>> import numpy as np
>>> np.random.randn()
0.5868205495246934
>>> np.random.randn()
1.0092581568418932
>>> np.random.randn()
0.056108818546049016
>>> np.random.randn(2, 4)
array([[-1.19376191, 0.28157974, 0.77366971, 0.06783304],
[-1.8541143 , 0.75650134, 0.59333362, -0.65325825]])
zip(a, b)
生成矩阵大小元组
>>> a = [1,2,3,4]
>>> b = [5,6,7,8]
>>> zip(a, b)
<zip object at 0x7f48aac07140>
>>> list(zip(a,b))
[(1, 5), (2, 6), (3, 7), (4, 8)]
>>> for x, y in zip(a, b):
... print(x, y)
...
1 5
2 6
3 7
4 8
# If their size does not match ?
>>> a = [1,2,3,4]
>>> b = [1,2,3]
>>> zip(a,b)
<zip object at 0x7f48a93bfb00>
>>> list(zip(a,b))
[(1, 1), (2, 2), (3, 3)]
Network 类的构造
import numpy as np
class Network(object):
def __init__(self, sizes):
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]
对构造函数进行测试…
# Test
network = Network([8, 4, 1])
print(network.biases)
print(network.weights)
[array([[-0.69137659],
[ 0.38159818],
[ 1.99258114],
[-0.21144943]]), array([[1.71569818]])]
# Biases: [<Array 4x1>, <Array 1x1>]
[array([[-1.23719403, 1.53304051, 1.33083345, 0.57793759, 1.64201695,
-1.09019958, -1.72304327, -1.37257551],
[ 1.55140653, -0.66066249, 0.24355521, -0.06090572, 0.7280547 ,
-0.09691786, -1.06502271, -0.39097928],
[-0.37874956, 0.71162084, 0.23456423, -0.34639554, -1.13344851,
0.89943267, 0.44945541, -0.59622082],
[ 1.90930694, 0.4098545 , 1.7352552 , 1.04517198, -0.2241232 ,
0.14171245, 0.07186716, 2.50271226]]), array([[ 0.58568595, -1.9973499 , 0.36131549, -0.07790474]])]
# Weights: [<Array 4x8>, <Array 1x4>]
为什么要这么构造?我们回忆…
- Our notations are: $a^{(2)} = \sigma(W^{(2)}a^{(1)}+b^{(2)})$
- Namely $\begin {bmatrix} a_1^{(2)}\\ a_2^{(2)}\\ \vdots\\ a_k^{(2)}\\ \end {bmatrix}= \sigma( \begin {bmatrix} w^{(2)}_{1,1} & w^{(2)}_{1,2} & \cdots & w^{(2)}_{1,n} \\ w^{(2)}_{2,1} & w^{(2)}_{2,2} & \cdots & w^{(2)}_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(2)}_{k,1} & w^{(2)}_{k,2} & \cdots & w^{(2)}_{k,n} \end {bmatrix}\begin {bmatrix} a_1^{(1)}\\ a_2^{(1)}\\ \vdots\\ a_n^{(1)}\\ \end {bmatrix} + \begin {bmatrix} b_1^{(2)}\\ b_2^{(2)}\\ \vdots\\ b_k^{(2)}\\ \end {bmatrix})$
- Suppose Layer $m-1$ has $n$ neurons and layer $m$ has $k$ neurons.
- Layer $m$ has $k$ biases because each neuron in layer $m$ has its own bias.
- Each neuron in layer $m$ has $n$ related weights in layer $m-1$, namely neuron $i$ in layer $m$ has its weights $w_{i, 1}, w_{i, 2}, \cdots, w_{1, n}$
graph LR subgraph "Layer m-1" A0("1") A1("2") A2("3") A3("4") end subgraph "Layer m" B2("m(i-1)") B0("m(i)") B1("m(i+1)") end A0 --> B0 A1 --> B0 A2 --> B0 A3 --> B0
也就是说,layer $m-1$ 与 layer $m$ 之间的权矩阵大小应该是 len(m) * len(m-1)
, namely 构造函数中的 y * x
.
Sigmoid function and its deriative
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def sigmoid_deriative(x):
return sigmoid(x) * (1-sigmoid(x))
Stochastic gradient descent
Concept
我们回忆,在理论学习中,我们学到的 SGD 内容。
Batch_size? Epoch? Iteration?
比如你有1000个数据,这个数据集可能太大了,全部跑一次再调参很慢,于是可以分成100个为一个数据集,这样有10份。batch_size=100
这100个数据组成的数据集叫 batch,每跑完一个 batch 都要更新参数,这个过程叫一个iteration.
epoch 指的就是跑完这 10 个 batch ( 10 个 iteration )的这个过程.
Implemention
# Stochastic gradient descent
def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
if test_data:
n_test = len(test_data)
n = len(training_data)
for j in range(0, epochs):
random.shuffle(training_data)
# Divide training data into lists
mini_batches = [training_data[mini_batch_size*(k): mini_batch_size*(
k+1)] for k in range(0, (n+mini_batch_size-1)//mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print(
f'''Epoch {j} finished, with an accuracy of {self.evaluate(test_data)}/{n_test} ({self.evaluate(test_data)/n_test})''')
else:
print(f"Epoch {j} finished.")
Parameters
- training_data: list of tuples, where each tuple is (DATA, LABEL)
- epochs: determines the training data would be used how many times
- mini_batch_size: determines the size of every small batch
- eta: learning rate, in $ v \rightarrow (v’=v-\eta\nabla C)$
- test_data: for verifying accuracy after each epoch
Update mini-batches
def update_mini_batch(self, mini_batch, eta):
# Init nabla matrices
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# Accumulate to calc nablas
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
# Update weights and biases
self.biases = [b-nb/len(mini_batch)*eta for b,
nb in zip(self.biases, nabla_b)]
self.weights = [w-nw/len(mini_batch)*eta for w,
nw in zip(self.weights, nabla_w)]
Back propagation
graph LR subgraph "Layer m-1" A0("1") A1("2") A2("3") A3("4") end subgraph "Layer m" B2("output(i-1)") B0("output(i)") B1("output(i+1)") end subgraph "Anticipated
Label" C2("y(i-1)") C0("y(i)") C1("y(i+1)") end A0 --> B0 A1 --> B0 A2 --> B0 A3 --> B0 B0 --> C0 B1 --> C1 B2 --> C2
考虑计算过程…
graph RL CX("$C$") aE("$a^L$") zE("$z^L$") wE("$w^L$") aEL("$a^{L-1}$") bE("$b^L$") zEL("$z^{L-1}$") wEL("$w^{L-1}$") aELL("$a^{L-2}$") bEL("$b^{L-1}$") aE -->|"$(y_x-a^L_x)^2$"| CX zE -->|"$\sigma$"| aE wE --> zE aEL --> zE bE --> zE zEL -->|"$\sigma$"| aEL wEL --> zEL aELL --> zEL bEL --> zEL dot("$\cdots$") --> aELL
- Then the core equations of back propagation…
- $\frac {\partial C} {\partial z_i^L} = \frac {\partial C} {\partial a_i^L}\frac {\partial a_i^L} {\partial z_i^L} = (a_i^L-y)\sigma’(z_i^L)$ (Initialize)
- According to $z^{(L)}_i = \sum_jw^{(L)}_{i, j}a_j^{(L-1)}+ b_i^{(L)}$, suppose we have calculated $\frac {\partial C} {\partial z^M_i}$ for all neuron $i$ in layer $M$.
- How can we get $\frac {\partial C} {\partial w^M_{i, j}}$ for all neuron $i$ in layer $M$ and $j$ in layer $M-1$?
- $\frac {\partial C} {\partial w^M_{i, j}} = \frac {\partial C} {\partial z^M_i} \frac {\partial z^M_i}{\partial w^M_{i, j}} = \frac {\partial C} {\partial z^M_i} a_j^{M-1}$
- How can we get $\frac {\partial C} {\partial b^M_i} $ for all neuron $i$ in layer $M$?
- $\frac {\partial C} {\partial b^M_{i}} = \frac {\partial C} {\partial z^M_i} \frac {\partial z^M_i}{\partial b^M_{i}} = \frac {\partial C} {\partial z^M_i}$
- How can we get $\frac {\partial C} {\partial a^{M-1}_j} $ and $\frac {\partial C} {\partial z^{M-1}_j} $ for all neuron $j$ in layer $M-1$?
- $\frac {\partial C} {\partial a^{M-1}_j} = \sum_i \frac {\partial C}{\partial z^M_i} \frac {\partial z^M_i} {a_j^{M-1}} = \sum_i \frac {\partial C}{\partial z^M_i} w_{i,j}^M$
- Then $\frac {\partial C} {\partial z^{M-1}_j} = \frac {\partial C} {\partial a^{M-1}_j} \sigma’(z^{M-1}_j)$
- How can we get $\frac {\partial C} {\partial w^M_{i, j}}$ for all neuron $i$ in layer $M$ and $j$ in layer $M-1$?
- Then by recursion, we could calculate all the partial derivatives of weights and biases.
def backprop(self, x, y):
'''Return (nabla_b, nabla_w) representing the
gradient for the cost function C.
Each of the return value are layer-by-layer lists of numpy arrays.'''
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# Feed forward
activation = x
activations = [x]
zs = [] # Store all Z vectors. z = w*a + b
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation) + b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
'''After feeding forward in a M-layer network,
Zs have M-1 elements while Activations have M elements'''
# Preparations for recursion
delta = (activations[-1] - y) * sigmoid_deriative(z[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
for i in range(2, self.num_layers):
delta = np.dot(self.weights[-i+1].transpose(), delta) * sigmoid_deriative(zs[-i])
nabla_b[-i] = delta
nabla_w[-i] = np.dot(delta, activations[-i-1].transpose())
return (nabla_b, nabla_w)
Evaluate
Use np.argmax()
to find the predicted label
>>> import numpy as np
>>> a = [1,2,3,4,5,6,7,8]
>>> np.argmax(a)
7
def evaluate(self, test_data, see=False):
# for x, y in test_data:
# print(self.feedforward(x))
test_results = [(np.argmax(self.feedforward(x)), y) for x, y in test_data]
return sum(int(x==y) for (x, y) in test_results)