再探反向传播算法(手写体识别Python实例)

在上一篇博文再探反向传播算法(推导)中,我们详细介绍了BP算法的由来及其详细推导过程。本篇博文将以手写体识别为例,分别用两个不同的目标函数(交叉熵和平方误差)来写出反向传播的源码(基于Python 3.x)。

1.网络结构

本例的数据集和网络设计结构,均来自coursera第五周的课后作业,关于手写体识别。该数据的维度为 5000 × 400 ,即5000个样本(图片),维度为400。因此,网络的输入层为400个神经元,另外两个网络层的尺寸分别是,隐藏层25,输出层10(1~10中类型,其中10表示0)。网络结构图如下所示:

这里写图片描述
其中:

L = 神经网络总共包含的层数 S l = l 层的神经元数目 K = 输出层的神经元数,亦即分类的数目 w i j l = l j l + 1 i

即对如上网络结构来说, L = 3 , s 1 = 400 , s 2 = 25 , s 3 = K = 10 a i l 表示第 l 层第 i 个神经元的激活值, b l 表示第 l 层的偏置。

根据再探反向传播算法(推导)中的推导过程,可以得出如下只含有一个样本时的梯度求解公式(不含正则化项):

(1) J w l = δ l + 1 ( a l ) T (2) δ l = ( w l ) T δ l + 1 f ( z l ) (3) δ i L = J z i L = J a i L a i L z i L = J a i L f ( z i L ) z i L = J a i L f ( z i L ) (4) J b l = δ l + 1

2.平方误差目标函数

我们先用平方误差函数作为目标函数来构建网络,然后再使用交叉熵。平方误差的目标函数如下:

J ( W , b ) = [ 1 m k = 1 m J ( W , b ; x ( k ) , y ( k ) ) ] + λ 2 l = 1 L 1 i = 1 S l + 1 j = 1 S l ( w i j l ) 2 = [ 1 m k = 1 m ( 1 2 ( h W , b ( x ( k ) ) y ( k ) ) 2 ) ] + λ 2 l = 1 L 1 i = 1 S l + 1 j = 1 S l ( w i j l ) 2

对于平方误差函数来说, δ L = [ a L y ] f ( z L ) ,且:

w i j l J ( W , b ) = [ 1 m k = 1 m w i j l J ( W , b ; x ( k ) , y ( k ) ) ] + λ w i j l b i l J ( W , b ) = 1 m k = 1 m b i l J ( W , b ; x ( k ) , y ( k ) )

正向传播:

m, n = np.shape(X)  # m:samples, n: dimensions
    a1 = X.T  # 400 by 5000
    z2 = np.dot(W1, a1) + b1  # 25 by 400 dot 400 by 5000 + 25 by 1= 25 by 5000
    a2 = sigmoid(z2)  # 25 by 5000
    z3 = np.dot(W2, a2) + b2  # 10 by 25 dot 25 by 5000 + 10 by 1= 10 by 5000
    a3 = sigmoid(z3)  # 10 by 5000
    cost = (1/m)*np.sum((a3-y_label)**2)+(lambd/2)*(np.sum(W1**2)+np.sum(W2**2))

温馨提示:写代码的时候备注上维度,可以很方便检查错误

反向传播:

delta3=-(y_label-a3)*sigmoidGradient(z3)# 10 by 5000 第L层残差
    df_w2=np.dot(delta3,a2.T)# 10 by 5000 dot 5000 by 25 = 10 by 25# J对w2的导数
    df_w2=(1/m)*df_w2+lambd*W2# J对w2的导数 + 正则化项对于的J对w2的导数

    delta2=np.dot(W2.T,delta3)*sigmoidGradient(z2)# 25 by 10 dot 10 by 5000 dot 25 by 5000= 25 by 5000
    df_w1=np.dot(delta2,a1.T)# 25 by 5000 dot 5000 by 400 = 25 by 400
    df_w1=(1/m)*df_w1+lambd*W1

    df_b1=(1/m)*np.sum(delta2,axis=1).reshape(b1.shape)#
    df_b2=(1/m)*np.sum(delta3,axis=1).reshape(b2.shape)

完整源码戳此处

3.交叉熵目标函数

J ( W , b ) = 1 m k = 1 m [ y ( k ) log ( h W , b ( x ( k ) ) ) + ( 1 y ( k ) ) log ( 1 h W , b ( x ( k ) ) ) ] + λ 2 m l = 1 L 1 i = 1 S l + 1 j = 1 S l ( w i j l ) 2

对于平方误差函数来说, δ L = [ a L y ] ,且:

w i j l J ( W , b ) = [ 1 m k = 1 m w i j l J ( W , b ; x ( k ) , y ( k ) ) ] + λ m w i j l b i l J ( W , b ) = 1 m k = 1 m b i l J ( W , b ; x ( k ) , y ( k ) )

正向传播:

a1 = X.T  # 400 by 5000
    z2 = np.dot(W1, a1) + b1  # 25 by 400 * 400 by 5000 + 25 by 1= 25 by 5000
    a2 = sigmoid(z2)  # 25 by 5000
    z3 = np.dot(W2, a2) + b2  # 10 by 25 * 25 by 5000 + 10 by 1= 10 by 5000
    a3 = sigmoid(z3)  # 10 by 5000
    cost = (1/m)*np.sum(np.sum(-y_label*np.log(a3)-(1-y_label)*np.log(1-a3)))
    cost = cost+(lambd / (2*m)) * (np.sum(W1 ** 2) + np.sum(W2 ** 2))
    # 除了目标函数变了,其他均没变

反向传播:

delta3=(a3-y_label)
    df_w2=np.dot(delta3,a2.T)# 10 by 5000 dot 5000 by 25 = 10 by 25
    df_w2=(1/m)*df_w2+(lambd/m)*W2
    delta2=np.dot(W2.T,delta3)*sigmoidGradient(z2)# 25 by 10 dot 10 by 5000 * 25 by 5000= 25 by 5000
    df_w1=np.dot(delta2,a1.T)# 25 by 5000 dot 5000 by 400 = 25 by 400
    df_w1=(1/m)*df_w1+(lambd/m)*W1
    df_b1=(1/m)*np.sum(delta2,axis=1).reshape(b1.shape)#
    df_b2=(1/m)*np.sum(delta3,axis=1).reshape(b2.shape)

完整源码戳此处

4.其余部分代码说明

4.1 载入数据集

def loadData():
    data=load.loadmat('ex4data1.mat')
    X=data['X']# 5000 by 400 samples by dimensions
    y=data['y'].reshape(5000)
    eye=np.eye(10)
    y_label=eye[:,y-1] # 10 by 5000# 矩阵化标签
    ss=StandardScaler() # 标准化
    X=ss.fit_transform(X)
    return X,y,y_label

关于矩阵化标签,戳此处怎么把数据集的输出值转换成只含有0,1的标签向量?
由于Python中下标是从0开始索引,所以第6行代码中,y减去了1

4.2训练及保存参数

def train():
    X, y, y_label=loadData()
    m,n=np.shape(X)# m:samples, n: dimensions
    input_layer_size=400
    hidden_layer_size=25
    output_layer_size=10

    epsilong_init=0.12
    W1=np.random.rand(hidden_layer_size,input_layer_size)*2*epsilong_init-epsilong_init
    W2=np.random.rand(output_layer_size,hidden_layer_size)*2*epsilong_init-epsilong_init
    b1=np.random.rand(hidden_layer_size,1)*2*epsilong_init-epsilong_init
    b2=np.random.rand(output_layer_size,1)*2*epsilong_init-epsilong_init
#第8行到第12行的作用是: 使得初始化后每个参数取值的绝对值都小于0.12

    for i in range(iteration):
        arr = np.arange(5000)# 生成1-5000
        np.random.shuffle(arr)# 随机打乱
        index = arr[:500]
        batch_X=X[index,:]# 取前500个构成一个batch
        batch_y=y_label[:,index]
#第16行到第20行的作用是:每个随机取500个样本构成一个batch

        c,df_w1, df_w2, df_b1, df_b2=costFandGradient(batch_X, batch_y, W1, b1, W2, b2, lambd)
        cost.append(round(c,4))
        W1, b1, W2, b2=gradientDescent(learn_rate,W1,b1,W2,b2,df_w1,df_w2,df_b1,df_b2)

    p={'W1':W1,'b1':b1,'W2':W2,'b2':b2}
    temp=open('data','wb')
    pickle.dump(p,temp)
#第27行到第29行的作用是:以字典的形式保存参数,文件名为data
#也就是说,训练好一次之后,下次直接载入保存好的参数模型预测即可,不用再训练

4.2载入模型参数及预测

def prediction():
    X, y, y_label = loadData()

    p = open('data', 'rb')
    data = pickle.load(p)
    W1 = data['W1']
    W2 = data['W2']
    b1 = data['b1']
    b2 = data['b2']
#第4行到第9行的作用是:载入模型参数,并赋值给相关变量

    a1 = X.T  # 400 by 5000
    z2 = np.dot(W1, a1) + b1  # 25 by 400 * 400 by 5000 + 25 by 1= 25 by 5000
    a2 = sigmoid(z2)  # 25 by 5000
    z3 = np.dot(W2, a2) + b2  # 10 by 25 * 25 by 5000 + 10 by 1= 10 by 5000
    a3 = sigmoid(z3)  # 10 by 5000

    y_pre = np.zeros(a3.shape[1], dtype=int)
    for i in range(a3.shape[1]):
        col = a3[:, i]# 遍历5000个样本输出的预测
        index = np.where(col == np.max(col))[0][0] + 1# 选择概率值最大的索引,由于Python从0开始索引,所以加了一个1
        y_pre[i] = index
    print(accuracy_score(y,y_pre))
相关文章
相关标签/搜索