在学习了相关的理论基础之后,我们自然是要投入到实践之中,完成这项机器学习领域的 Hello World 任务.

本篇文章会用来记录 c7w 操作的全过程,以及他试图使用 python3 来拟合一篇 python2 教程的痛苦.

MNIST Classifier Implemention

Fetch dataset

Download

下载数据包,前往 MNIST 官方网站,把 Trainset 和 Testset 的两组数据全部下载。然后去摸鱼

image-20210903235728102

数据格式

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):

>[offset] [type]     [value]     [description]`
>`0000   32 bit integer 0x00000801(2049) magic number (MSB first)`
>`0004   32 bit integer 60000      number of items`
>`0008   unsigned byte  ??        label`
>`0009   unsigned byte  ??        label`
>`........`
>`xxxx   unsigned byte  ??        label
>The labels values are 0 to 9.

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

>[offset] [type]     [value]     [description]`
>`0000   32 bit integer 0x00000803(2051) magic number`
>`0004   32 bit integer 60000      number of images`
>`0008   32 bit integer 28        number of rows`
>`0012   32 bit integer 28        number of columns`
>`0016   unsigned byte  ??        pixel`
>`0017   unsigned byte  ??        pixel`
>`........`
>`xxxx   unsigned byte  ??        pixel

Use Pickle to load dataset

Define network

Init function

np.random.randn(d0, d1, ...) 生成随机初值

生成均值为 0,标准差为 1 的正态分布

>>> import numpy as np
>>> np.random.randn()
0.5868205495246934
>>> np.random.randn()
1.0092581568418932
>>> np.random.randn()
0.056108818546049016
>>> np.random.randn(2, 4)
array([[-1.19376191,  0.28157974,  0.77366971,  0.06783304],
       [-1.8541143 ,  0.75650134,  0.59333362, -0.65325825]])

zip(a, b) 生成矩阵大小元组

>>> a = [1,2,3,4]
>>> b = [5,6,7,8]
>>> zip(a, b)
<zip object at 0x7f48aac07140>
>>> list(zip(a,b))
[(1, 5), (2, 6), (3, 7), (4, 8)]
>>> for x, y in zip(a, b):
...     print(x, y)
...
1 5
2 6
3 7
4 8

# If their size does not match ?
>>> a = [1,2,3,4]
>>> b = [1,2,3]
>>> zip(a,b)
<zip object at 0x7f48a93bfb00>
>>> list(zip(a,b))
[(1, 1), (2, 2), (3, 3)]

Network 类的构造

import numpy as np


class Network(object):
    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

对构造函数进行测试…

# Test
network = Network([8, 4, 1])
print(network.biases)
print(network.weights)
[array([[-0.69137659],
       [ 0.38159818],
       [ 1.99258114],
       [-0.21144943]]), array([[1.71569818]])]
# Biases: [<Array 4x1>, <Array 1x1>]

[array([[-1.23719403,  1.53304051,  1.33083345,  0.57793759,  1.64201695,
        -1.09019958, -1.72304327, -1.37257551],
       [ 1.55140653, -0.66066249,  0.24355521, -0.06090572,  0.7280547 ,    
        -0.09691786, -1.06502271, -0.39097928],
       [-0.37874956,  0.71162084,  0.23456423, -0.34639554, -1.13344851,    
         0.89943267,  0.44945541, -0.59622082],
       [ 1.90930694,  0.4098545 ,  1.7352552 ,  1.04517198, -0.2241232 ,    
         0.14171245,  0.07186716,  2.50271226]]), array([[ 0.58568595, -1.9973499 ,  0.36131549, -0.07790474]])]
# Weights: [<Array 4x8>, <Array 1x4>]

为什么要这么构造?我们回忆…

  • Our notations are: $a^{(2)} = \sigma(W^{(2)}a^{(1)}+b^{(2)})$
  • Namely $\begin {bmatrix} a_1^{(2)}\\ a_2^{(2)}\\ \vdots\\ a_k^{(2)}\\ \end {bmatrix}= \sigma( \begin {bmatrix} w^{(2)}_{1,1} & w^{(2)}_{1,2} & \cdots & w^{(2)}_{1,n} \\ w^{(2)}_{2,1} & w^{(2)}_{2,2} & \cdots & w^{(2)}_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(2)}_{k,1} & w^{(2)}_{k,2} & \cdots & w^{(2)}_{k,n} \end {bmatrix}\begin {bmatrix} a_1^{(1)}\\ a_2^{(1)}\\ \vdots\\ a_n^{(1)}\\ \end {bmatrix} + \begin {bmatrix} b_1^{(2)}\\ b_2^{(2)}\\ \vdots\\ b_k^{(2)}\\ \end {bmatrix})$
  • Suppose Layer $m-1$​ has $n$​​ neurons and layer $m$ has $k$ neurons.
    • Layer $m$ has $k$​ biases because each neuron in layer $m$ has its own bias.
    • Each neuron in layer $m$​ has $n$ related weights in layer $m-1$​, namely neuron $i$ in layer $m$ has its weights $w_{i, 1}, w_{i, 2}, \cdots, w_{1, n}$
graph LR
subgraph "Layer m-1"
A0("1")
A1("2")
A2("3")
A3("4")
end
subgraph "Layer m"
B2("m(i-1)")
B0("m(i)")
B1("m(i+1)")
end
A0 --> B0
A1 --> B0
A2 --> B0
A3 --> B0

也就是说,layer $m-1$ 与 layer $m$​​ 之间的权矩阵大小应该是 len(m) * len(m-1), namely 构造函数中的 y * x.

Sigmoid function and its deriative

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def sigmoid_deriative(x):
    return sigmoid(x) * (1-sigmoid(x))

Stochastic gradient descent

Concept

我们回忆,在理论学习中,我们学到的 SGD 内容。

Batch_size? Epoch? Iteration?

比如你有1000个数据,这个数据集可能太大了,全部跑一次再调参很慢,于是可以分成100个为一个数据集,这样有10份。batch_size=100

这100个数据组成的数据集叫 batch,每跑完一个 batch 都要更新参数,这个过程叫一个iteration.

epoch 指的就是跑完这 10 个 batch ( 10 个 iteration )的这个过程.

Implemention

# Stochastic gradient descent
def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
    if test_data:
        n_test = len(test_data)
    n = len(training_data)

    for j in range(0, epochs):
        random.shuffle(training_data)
        # Divide training data into lists
        mini_batches = [training_data[mini_batch_size*(k): mini_batch_size*(
            k+1)] for k in range(0, (n+mini_batch_size-1)//mini_batch_size)]

    for mini_batch in mini_batches:
        self.update_mini_batch(mini_batch, eta)

    if test_data:
        print(
            f'''Epoch {j} finished, with an accuracy of {self.evaluate(test_data)}/{n_test} ({self.evaluate(test_data)/n_test})''')
    else:
        print(f"Epoch {j} finished.")

Parameters

  • training_data: list of tuples, where each tuple is (DATA, LABEL)
  • epochs: determines the training data would be used how many times
  • mini_batch_size: determines the size of every small batch
  • eta: learning rate, in $ v \rightarrow (v’=v-\eta\nabla C)$
  • test_data: for verifying accuracy after each epoch

Update mini-batches

def update_mini_batch(self, mini_batch, eta):
    # Init nabla matrices
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]

    # Accumulate to calc nablas
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    
    # Update weights and biases
    self.biases = [b-nb/len(mini_batch)*eta for b,
                   nb in zip(self.biases, nabla_b)]
    self.weights = [w-nw/len(mini_batch)*eta for w,
                    nw in zip(self.weights, nabla_w)]

Back propagation

graph LR
subgraph "Layer m-1"
A0("1")
A1("2")
A2("3")
A3("4")
end
subgraph "Layer m"
B2("output(i-1)")
B0("output(i)")
B1("output(i+1)")
end
subgraph "Anticipated
Label" C2("y(i-1)") C0("y(i)") C1("y(i+1)") end A0 --> B0 A1 --> B0 A2 --> B0 A3 --> B0 B0 --> C0 B1 --> C1 B2 --> C2

考虑计算过程…

graph RL
CX("$C$")
aE("$a^L$")
zE("$z^L$")
wE("$w^L$")
aEL("$a^{L-1}$")
bE("$b^L$")
zEL("$z^{L-1}$")
wEL("$w^{L-1}$")
aELL("$a^{L-2}$")
bEL("$b^{L-1}$")

aE -->|"$(y_x-a^L_x)^2$"| CX
zE -->|"$\sigma$"| aE
wE --> zE
aEL --> zE
bE --> zE
zEL -->|"$\sigma$"| aEL
wEL --> zEL
aELL --> zEL
bEL --> zEL
dot("$\cdots$") --> aELL
  • Then the core equations of back propagation…
    • $\frac {\partial C} {\partial z_i^L} = \frac {\partial C} {\partial a_i^L}\frac {\partial a_i^L} {\partial z_i^L} = (a_i^L-y)\sigma’(z_i^L)$​​ (Initialize)​
    • According to $z^{(L)}_i = \sum_jw^{(L)}_{i, j}a_j^{(L-1)}+ b_i^{(L)}$​​​​, suppose we have calculated $\frac {\partial C} {\partial z^M_i}$​​​ for all neuron $i$​​​ in layer $M$​​​.
      • How can we get $\frac {\partial C} {\partial w^M_{i, j}}$​​​ for all neuron $i$​​​ in layer $M$​​ and $j$​​ in layer $M-1$​​​​?
        • $\frac {\partial C} {\partial w^M_{i, j}} = \frac {\partial C} {\partial z^M_i} \frac {\partial z^M_i}{\partial w^M_{i, j}} = \frac {\partial C} {\partial z^M_i} a_j^{M-1}$​​
      • How can we get $\frac {\partial C} {\partial b^M_i} $​​ for all neuron $i$​​ in layer $M$​?
        • $\frac {\partial C} {\partial b^M_{i}} = \frac {\partial C} {\partial z^M_i} \frac {\partial z^M_i}{\partial b^M_{i}} = \frac {\partial C} {\partial z^M_i}$​​​
      • How can we get $\frac {\partial C} {\partial a^{M-1}_j} $ and $\frac {\partial C} {\partial z^{M-1}_j} $ for all neuron $j$ in layer $M-1$?
        • $\frac {\partial C} {\partial a^{M-1}_j} = \sum_i \frac {\partial C}{\partial z^M_i} \frac {\partial z^M_i} {a_j^{M-1}} = \sum_i \frac {\partial C}{\partial z^M_i} w_{i,j}^M$
        • Then $\frac {\partial C} {\partial z^{M-1}_j} = \frac {\partial C} {\partial a^{M-1}_j} \sigma’(z^{M-1}_j)$
    • Then by recursion, we could calculate all the partial derivatives of weights and biases.
def backprop(self, x, y):
    '''Return (nabla_b, nabla_w) representing the
    gradient for the cost function C.
    Each of the return value are layer-by-layer lists of numpy arrays.'''
    
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    
    # Feed forward
    activation = x
    activations = [x]
    zs = [] # Store all Z vectors. z = w*a + b
    
    for b, w in zip(self.biases, self.weights):
        z = np.dot(w, activation) + b
        zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)
    
    '''After feeding forward in a M-layer network, 
    Zs have M-1 elements while Activations have M elements'''
    # Preparations for recursion
    delta = (activations[-1] - y) * sigmoid_deriative(z[-1])

    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    
    for i in range(2, self.num_layers):
        delta = np.dot(self.weights[-i+1].transpose(), delta) * sigmoid_deriative(zs[-i])
        nabla_b[-i] = delta
        nabla_w[-i] = np.dot(delta, activations[-i-1].transpose())
    
    return (nabla_b, nabla_w)

Evaluate

Use np.argmax() to find the predicted label

>>> import numpy as np
>>> a = [1,2,3,4,5,6,7,8]
>>> np.argmax(a)
7
def evaluate(self, test_data, see=False):
    # for x, y in test_data:
    #     print(self.feedforward(x))
    test_results = [(np.argmax(self.feedforward(x)), y) for x, y in test_data]
    return sum(int(x==y) for (x, y) in test_results)

Test our network

Reference