机器学习 – LSTM RNN反向传播

有人能否清楚解释LSTM RNN的反向传播?
这是我正在使用的类型结构.我的问题不在于什么是反向传播,我理解它是一种计算用于调整神经网络权重的假设和输出误差的逆序方法.我的问题是LSTM反向传播与常规神经网络的不同之处.

enter image description here

我不确定如何找到每个门的初始误差.您是否使用每个门的第一个误差(由假设减去输出计算)?或者你通过一些计算调整每个门的误差?我不确定细胞状态如何在LSTM的反向支持中发挥作用.我已经彻底查看了LSTM的良好来源,但还没有找到任何.

这是个好问题.您当然应该查看建议的帖子以获取详细信息,但这里的完整示例也会有所帮助.

RNN Backpropagaion

我认为首先谈论普通的RNN是有意义的(因为LSTM图特别令人困惑)并理解它的反向传播.

当谈到反向传播时,关键的想法是网络展开,这是将RNN中的递归转换为前馈序列的方法(如上图所示).请注意,抽象RNN是永恒的(可以是任意大的),但每个特定的实现都是有限的,因为内存是有限的.结果,展开的网络确实是一个很长的前馈网络,几乎没有复杂性,例如共享不同层中的权重.

让我们来看一个经典的例子,char-rnn by Andrej Karpathy.这里每个RNN单元产生两个输出h [t](送入下一个单元的状态)和y [t](此步骤的输出)由下面的公式,其中Wxh,Whh和Why为共享参数:

rnn-cell-formula

在代码中,它只是三个矩阵和两个偏向量:

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

前向传递非常简单,这个例子使用softmax和交叉熵损失.请注意,每次迭代使用相同的W *和h *数组,但输出和隐藏状态不同:

# forward pass
for t in xrange(len(inputs)):
  xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
  xs[t][inputs[t]] = 1
  hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
  ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
  ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
  loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)

现在,向后传递的执行方式与前馈网络完全相同,但W *和h *数组的梯度会累积所有单元格中的渐变:

for t in reversed(xrange(len(inputs))):
  dy = np.copy(ps[t])
  dy[targets[t]] -= 1
  dWhy += np.dot(dy, hs[t].T)
  dby += dy
  dh = np.dot(Why.T, dy) + dhnext # backprop into h
  dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
  dbh += dhraw
  dWxh += np.dot(dhraw, xs[t].T)
  dWhh += np.dot(dhraw, hs[t-1].T)
  dhnext = np.dot(Whh.T, dhraw)

上面的两个传递都是在大小为len(输入)的块中完成的,这对应于展开的RNN的大小.您可能希望将其设置得更大,以便在输入中捕获更长的依赖关系,但您可以通过存储每个单元格的所有输出和渐变来为此付费.

LSTM有什么不同

LSTM图片和公式看起来令人生畏,但是一旦你编写了简单的vanilla RNN,LSTM的实现就差不多了.例如,这是向后传递:

# Loop over all cells, like before
d_h_next_t = np.zeros((N, H))
d_c_next_t = np.zeros((N, H))
for t in reversed(xrange(T)):
  d_x_t, d_h_prev_t, d_c_prev_t, d_Wx_t, d_Wh_t, d_b_t = lstm_step_backward(d_h_next_t + d_h[:,t,:], d_c_next_t, cache[t])
  d_c_next_t = d_c_prev_t
  d_h_next_t = d_h_prev_t

  d_x[:,t,:] = d_x_t
  d_h0 = d_h_prev_t
  d_Wx += d_Wx_t
  d_Wh += d_Wh_t
  d_b += d_b_t

# The step in each cell
# Captures all LSTM complexity in few formulas.
def lstm_step_backward(d_next_h, d_next_c, cache):
  """
  Backward pass for a single timestep of an LSTM.

  Inputs:
  - dnext_h: Gradients of next hidden state, of shape (N, H)
  - dnext_c: Gradients of next cell state, of shape (N, H)
  - cache: Values from the forward pass

  Returns a tuple of:
  - dx: Gradient of input data, of shape (N, D)
  - dprev_h: Gradient of previous hidden state, of shape (N, H)
  - dprev_c: Gradient of previous cell state, of shape (N, H)
  - dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
  - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
  - db: Gradient of biases, of shape (4H,)
  """
  x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, z, next_h = cache

  d_z = o * d_next_h
  d_o = z * d_next_h
  d_next_c += (1 - z * z) * d_z

  d_f = d_next_c * prev_c
  d_prev_c = d_next_c * f
  d_i = d_next_c * g
  d_g = d_next_c * i

  d_a_g = (1 - g * g) * d_g
  d_a_o = o * (1 - o) * d_o
  d_a_f = f * (1 - f) * d_f
  d_a_i = i * (1 - i) * d_i
  d_a = np.concatenate((d_a_i, d_a_f, d_a_o, d_a_g), axis=1)

  d_prev_h = d_a.dot(Wh.T)
  d_Wh = prev_h.T.dot(d_a)

  d_x = d_a.dot(Wx.T)
  d_Wx = x.T.dot(d_a)

  d_b = np.sum(d_a, axis=0)

  return d_x, d_prev_h, d_prev_c, d_Wx, d_Wh, d_b

摘要

现在,回到你的问题.

My question is how is LSTM backpropagation different then regular Neural Networks

它们是不同层中的共享权重,还有一些您需要注意的其他变量(状态).除此之外,没有任何区别.

Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation?

首先,损失函数不一定是L2.在上面的例子中,它是一个交叉熵损失,所以初始错误信号得到它的梯度:

# remember that ps is the probability distribution from the forward pass
dy = np.copy(ps[t])  
dy[targets[t]] -= 1

请注意,它与普通前馈神经网络中的错误信号相同.如果使用L2损耗,信号确实等于地面实况减去实际输出.

在LSTM的情况下,它稍微复杂一些:d_next_h = d_h_next_t d_h [:,t,:],其中d_h是损失函数的上游梯度,这意味着每个单元的误差信号被累积.但是,再一次,如果您展开LSTM,您将看到与网络布线的直接对应关系.

相关文章
相关标签/搜索