Part-III: A gentle introduction to Neural Network Backpropagation


In part-II, we derived the back-propagation formula using a simple neural net architecture using the Sigmoid activation function. In this article, which is a follow on to part-II we expand upon the NN architecture, to add one more hidden layer and derive a generic backpropagation equation that can be applied to deep (multi-layered) neural networks.

This article attempts to explain back-propagation in an easier fashion with a simple NN, where each steps has been expanded in great detail and has explanations. After reading this text readers are encouraged to understand the more complex derivations given in Mitchell's book and elsewhere, to fully grasp the concept of back-propagation. You do need a basic understanding of partial derivatives and one of the best explanations are available in videos by Sal Khan at Khan academy.

The neural network which has 1 input, 2 hidden and 1 output units(neuron) along with the sigmoid activation function at each hidden and output layer, is given as:





Here, rh0=σ(wh0.i), rh=σ(wh.rh0) and ro=σ(wo.rh).

If t is given(expected) output in the training data, the Error E is defined as:

E=(tro)22Ewo=(tro)22wo=12.(tro)2wo ... eq.(1.1)


Let's revisit the chain rule:
dxdy=dxdz.dzdy
For a partial derivative form:
xy=xz.zy
Applying the chain rule to the right side of eq.(1.1),
Here, z=tro,x=z2, and y=wo ,therefore, 12.(tro)2wo=12.xy=12.xz.zy=12.z2z.zy=12.(tro)2(tro).(tro)wo=12.2(tro).(tro)wo ... eq.(1.2)
Further,
(tro)wo=(tσ(wo.rh))wo=twoσ(wo.rh)wo=0σ(wo.rh)wo (as,Cx=0,where C is a constant), ... eq.(1.3)
Let's solve for σ(wo.rh)wo,
Applying the chain rule again,
Here, z=wo.rh,x=σ(z), and y=wo ,therefore, σ(wo.rh)wo=xy=xz.zy=σ(z)z.zy=σ(wo.rh)(wo.rh).(wo.rh)wo ... eq.(1.4)
The sigmoid function is σ(x)=11+ex, with its derivative being:
dσ(x)dx=σ(x).(1σ(x))
Applying the sigmoid differentiation rule to eq.(1.4),
σ(wo.rh)wo=σ(wo.rh)(wo.rh).(wo.rh)wo=σ(wo.rh).(1σ(wo.rh)).(wo.rh)wo=σ(wo.rh).(1σ(wo.rh)).rh.(wo)wo=σ(wo.rh).(1σ(wo.rh)).rh
Therefore in eq.(1.3),
(tro)wo=0σ(wo.rh)wo=0σ(wo.rh).(1σ(wo.rh)).rh
And in eq.(1.2),
12.(tro)2wo=12.2(tro).(tro)wo=(tro).(0σ(wo.rh).(1σ(wo.rh)).rh)

Using the above derived value of 12.(tro)2wo,
Ewo=(tro)22wo=12.(tro)2wo=(tro).(0σ(wo.rh).(1σ(wo.rh)).rh)=(tro).σ(wo.rh).(1σ(wo.rh))).rh=(tro).ro.(1ro).rh (as,ro=σ(wo.rh)) ...eq.(1.5)
Let us now find how E changes in relation to wh or the rate of change of E w.r.t. wh the weight of the second hidden layer. This is also called the derivative of E w.r.t wh or Ewh.
Ewh=(tro)22wh=12.(tro)2wh ...eq.(2.1)

Applying the chain rule to the right side of eq.(2.1),
Here, z=tro,x=z2, and y=wh ,therefore, 12.(tro)2wh=12.xy=12.xz.zy=12.z2z.zy=12.(tro)2(tro).(tro)wh=12.2(tro).(tro)wh ... eq.(2.2)

Solve for the right side of eq.(2.2),(tro)wh,
(tro)wh=(tσ(wo.rh))wh=twhσ(wo.rh)wh=0σ(wo.rh)wh(as t is a constant) ...eq.(2.3) 
Applying the chain rule to right side of eq.(2.3), σ(wo.rh)wh
Here, z=wo.rh,x=σ(z), and y=wh ,therefore, σ(wo.rh)wh=xy=xz.zy=σ(z)z.zy=σ(wo.rh)(wo.rh).(wo.rh)wh ... eq.(2.4)
We know,
dσ(x)dx=σ(x).(1σ(x))
Applying the sigmoid differentiation rule to eq.(2.4),
σ(wo.rh)wh=σ(wo.rh)(wo.rh).(wo.rh)wh=σ(wo.rh).(1σ(wo.rh)).(wo.rh)wh=ro.(1ro).(wo.rh)wh(as,ro=σ(wo.rh)) ... eq.(2.5)
Solve for right side of eq.(2.5), (wo.rh)wh,
(wo.rh)wh=(wo.σ(wh.rh0))wh(as,rh=σ(wh.rh0))=wo.σ(wh.rh0)wh ... eq.(2.6)
Applying the chain rule to right side of eq.(2.6), σ(wh.rh0)wh,
Here, z=wh.rh0,x=σ(z), and y=wh ,therefore, σ(wh.rh0)wh=xy=xz.zy=σ(z)z.zy=σ(wh.rh0)(wh.rh0).(wh.rh0)wh=σ(wh.rh0)(1σ(wh.rh0).(rh0.whwh) (as,dσ(x)dx=σ(x).(1σ(x)))=σ(wh.rh0).(1σ(wh.rh0)).rh0 ... eq.(2.7)
Hence in eq.(2.6),
(wo.rh)wh=wo.σ(wh.rh0)wh=wo.σ(wh.rh0).(1σ(wh.rh0)).rh0 (from, eq.(2.7)) ...eq.(2.8)
Applying this value in eq.(2.5),
σ(wo.rh)wh=ro.(1ro).(wo.rh)wh=ro.(1ro).wo.σ(wh.rh0).(1σ(wh.rh0)).rh0 (from, eq.(2.8))
Applying this in eq.(2.3),
(tro)wh=0σ(wo.rh)wh=0ro.(1ro).wo.σ(wh.rh0).(1σ(wh.rh0)).rh0
And in eq.(2.2),
12.(tro)2wh=12.2(tro).(tro)wh=12.2(tro).0ro.(1ro).wo.σ(wh.rh0).(1σ(wh.rh0)).rh0 (from, eq.(2.3))

Using the above derived value of 12.(tro)2wh,
Ewh=(tro)22wh=12.(tro)2wh=12.2(tro).(0ro.(1ro).wo.σ(wh.i).(1σ(wh.rh0)).rh0)=(tro).(0ro.(1ro).wo.σ(wh.rh0).(1σ(wh.rh0)).rh0)=(tro).ro.(1ro).wo.rh.(1rh).rh0 ...eq.(2.9) (as,rh=σ(wh.rh0))
Let us now find how E changes in relation to wh0 or the rate of change of E w.r.t. wh0 the weight of the first hidden layer.
Ewh0=(tro)22wh0=12.(tro)2wh0 ...eq.(3.1)

Applying the chain rule to the right side of eq.(3.1),
Here, z=tro,x=z2, and y=wh0 ,therefore, 12.(tro)2wh0=12.xy=12.xz.zy=12.z2z.zy=12.(tro)2(tro).(tro)wh0=12.2(tro).(tro)wh0 ... eq.(3.2)
Solve for the right side of eq.(3.2),(tro)wh0,
(tro)wh0=(tσ(wo.rh))wh0=twh0σ(wo.rh)wh0=0σ(wo.rh)wh0 ...eq.(3.3) 
Applying the chain rule to right side of eq.(3.3), σ(wo.rh)wh0
Here, z=wo.rh,x=σ(z), and y=wh0 ,therefore, σ(wo.rh)wh0=xy=xz.zy=σ(z)z.zy=σ(wo.rh)(wo.rh).(wo.rh)wh0 ... eq.(3.4)
We know,
dσ(x)dx=σ(x).(1σ(x))
Applying the sigmoid differentiation rule to eq.(3.4),
σ(wo.rh)wh0=σ(wo.rh)(wo.rh).(wo.rh)wh0=σ(wo.rh).(1σ(wo.rh)).(wo.rh)wh0=ro.(1ro).(wo.rh)wh0(as,ro=σ(wo.rh)) ... eq.(3.5)
Solve for right side of eq.(3.5), (wo.rh)wh0,
(wo.rh)wh0=(wo.σ(wh.rh0))wh0(as,rh=σ(wh.rh0))=wo.σ(wh.rh0)wh0 ... eq.(3.6)
Applying the chain rule to right side of eq.(3.6), σ(wh.rh0)wh0,
Here, z=wh.rh0,x=σ(z), and y=wh0 ,therefore, σ(wh.rh0)wh0=xy=xz.zy=σ(z)z.zy=σ(wh.rh0)(wh.rh0).(wh.rh0)wh0=σ(wh.rh0).(1σ(wh.rh0)).(wh.rh0)wh0 ... eq.(3.7) (as,d(σ(x))dx=σ(x).(1σ(x)))
Let us solve for the right term of eq.(3.7),(wh.rh0)wh0
(wh.rh0)wh0=(wh.(σ(wh0.i)))wh0(as,rh0=who.i)=wh.(σ(wh0.i))wh0 ... eq.(3.8)
Applying the chain rule to right side of eq.(3.8), σ(wh0.i)wh0,
Here, z=wh0.i,x=σ(z), and y=wh0 ,therefore, σ(wh0.i)wh0=xy=σ(z)z.zy=σ(wh0.i)(wh0.i).(wh0.i)wh0=σ(wh0.i).(1σ(wh0.i)).i.(wh0)wh0 (applying value ofσ(x)x)=σ(wh0.i).(1σ(wh0.i)).i (as,xx=1) ... eq. (3.9)
Applying value of eq. (3.9) in eq.(3.8),
(wh.rh0)wh0=wh.σ(wh0.i)wh0=wh.σ(wh0.i).(1σ(wh0.i)).i (from, eq.(3.9)) ... eq.(3.10)
Applying value in eq.(3.10) in eq.(3.7),
σ(wh.rh0)wh0=σ(wh.rh0).(1σ(wh.rh0).(wh.rh0)wh0=σ(wh.rh0)(1σ(wh.rh0)).wh.σ(wh0.i).(1σ(wh0.i)).i(putting value of(wh.rh0)wh0 from eq.(3.10))=rh.(1rh.wh.rh0.(1rh0).i) ... eq.(3.11)(substituting values of σ(wh.rh0) and σ(wh0.i) by rh,rh0 respectively, for the sake of brevity)
Applying value in eq.(3.11) in eq.(3.6),
(wo.rh)wh0=wo.σ(wh.rh0)wh0=wo.rh.(1rh.wh.rh0.(1rh0).i)... eq.(3.12)(substituting value of σ(wh.rh0)wh0 from eq.(3.11)
Applying value of eq.(3.12) in eq.(3.5),
σ(wo.rh)wh0=ro.(1ro).(wo.rh)wh0=ro.(1ro).wo.rh.(1rh).wh.rh0.(1rh0).i ... eq.(3.13)(from eq. (3.12))
Applying value of eq.(3.13) in eq.(3.3),
(tro)wh0=0σ(wo.rh)wh0=0(ro.(1ro).wo.rh.(1rh).wh.rh0.(1rh0).i)... eq.(3.14)(from eq. (3.13))
Applying value of eq.(3.14) in eq.(3.2),
12.(tro)2wh0=12.2(tro).(tro)wh0=12.2(tro).(0(ro.(1ro).wo.rh.(1rh).wh.rh0.(1rh0).i))(from eq. (3.14))=12.2(tro).(ro.(1ro).wo.rh.(1rh).wh.rh0.(1rh0).i)

Using the above derived value in of 12.(tro)2wh0,
Ewh0=(tro)22wh0=12.2(tro).(ro.(1ro).wo.rh.(1rh).wh.rh0.(1rh0).i)=(tro).ro.(1ro).wo.rh.(1rh).wh.rh0.(1rh0).i ...eq.(3.15)
General rule:
From eqs. (1.5),(2.9) and (3.15) we see that,
Ewo=(tro).ro.(1ro).rhEwh=(tro).ro.(1ro).rh.wo.(1rh).rh0Ewh0=(tro).ro.(1ro).rh.wo.(1rh).rh0.wh.(1rh0).i
A clear pattern can be seen where the rate of change of E w.r.t. each unit weight is dependent on the rate of change of E of the downstream unit. Hence, we can modify the above equations to:
Ewo=(tro).ro.(1ro).rhEwh=Ewo.wo.(1rh).rh0Ewh0=Ewh.wh.(1rh0).i
The general form of Ewu, is:
Output unit,Ewu=(tro).ro.(1ro).iuHidden unit,Ewu=Ewu+1.wu+1.(1ru).iuor,Ewu=(tro).ro.(1ro).wu+1.(1ru).iuwhere,wu: weight of unit uwu+1:weight of unit downstream of uru: output of unit uiu: input to the unit u
Ewu determines the rate of change of E w.r.t. wu. When the rate of change becomes 0 or close to it, it means we have reached the lowest point in the error space signifying the lowest possible error. The goal is to change the weights in such a way such that we reach this lowest point after multiple iterations.

The rate of change is positive which means if we keep heading in its current direction (by modifying the weights) the rate of change will keep on increasing. To reduce this rate of change, we need to go in the opposite direction. Hence, weight wu has to be modified where the sign of change (increase or decrease in value of the weights) is determined by Ewu. Note the negative sign which allows weights to be correctly modified in such a way that reduces the rate of change.

Based on this idea, the generic weight update rule for a unit wu is given as:

Δwu=η.Ewufor an output unit,=η.(tro).ro.(1ro).iu=η.(tro).ro.(1ro).iufor a hidden unit,=η.(tro).ro.(1ro).wu+1.(1ru).iu=η.(tro).ro.(1ro).wu+1.(1ru).iu=weight update for all,wu=wu+Δwuwhere,wu :weight of unit uwu+1 :weight of unit downstream of uru :output of unit uiu :the input to the unit u

Here η is the learning rate and controls the numeric value by which weights are changed.

The weights can be updated for each training sample (stochastic gradient descent) or after multiple training samples (batch gradient descent) or after exhausting the training data (standard/batch gradient descent).

An important thing to note here is that we do not have to update the weights on actual numeric values of Ewu, since the only information we need is their ±sign to determine if weights have to be increased or decreased. However for simplicity, the original back-propagation uses their values directly, with η controlling their impact on weight update.

In later versions of back-propagation (like RProp), weights are not updated on numeric values of Ewo and Ewh, but their ± sign is used.

This concludes the esculent introduction to NN backpropagation. In another article, we will take derive the backpropagation for a convolution neural network (CNN).

Love or hate this article ? Have other ideas ? Please leave a comment below !

Comments

Popular posts from this blog

Part I: Backpropagation mechanics for a Convolutional Neural Network

Introducing Convolution Neural Networks with a simple architecture

Python: Sockets Programming - Non-blocking Client