Part-I: Neural Net Backpropagation using representative equation transformation
Backpropagation in Neural Networks (NN) is a critical step in adjusting the weights of each layer and making the NN model more accurate based on the provided training data.
In previous articles, we derived the backpropagation for a Neural Networks using the architecture of the network. In this series of articles, we convert the neural network architecture into a representative linear equation and use it to derive the backpropagation.
As in other articles, we start with a very simple NN and proceed to more complex NN in subsequent articles. Consider the NN below with 1 hidden layer and only a single neuron in each layer.
We have derived the backpropagation of this NN without using the equation form here.
The output of this NN is ro, and the prediction error for each training sample E is defined as:
E=(t−ro)22 ...eq.(1.1)
where t is given(expected) output for that specific training example. Our goal is minimize E by making ro as close to t.
In the output layer, ro is given as ro=wo.rh. Putting the value of rh we can write,ro=wo.(wh.i). Hence the linear equation:
ro=wo.(wh.i)
is the linear representative equation of the NN shown above. For any input i, we will get its corresponding output ro. Since, i is fixed, when training this equation the only values we can modify is of wo and wh. While we adopt a brute force approach and arbitrarily modify wo and wh, we are not guaranteed to come to optimized values of wo and wh. There is a method to this madness, and that method is backpropagation.
We know that during training, E=(t−ro)22. Putting the full value of ro, we get:
E=(t−(wo.(wh.i)))22 ...eq.(1.2)
This is representative cost function of the NN. Our goal is to reduce E such than in an ideal case it becomes 0 (this ideal case is extremely rare when training NNs). The way we will reduce E, is by changing (increasing or decreasing) the values of wo and wh. Hence, the main thing we need to find out is how E changes (increases/decreases) as wo and wh are increased (or decreased). This 'rate of change' of E is determined by partial derivatives of E w.r.t. wo and wh. Or, we need to find ∂E∂wo and ∂E∂wh.
Let us start by finding ∂E∂wo.
E=(t−(wo.(wh.i)))22 (from eq.(1.2))∂E∂wo=∂(t−(wo.(wh.i)))22∂wo=12.∂(t−(wo.wh.i))2∂wo ... eq.(2.1)
Let us revisit the chain rule:
dxdy=dxdz.dzdy
For a partial derivative:
∂x∂y=∂x∂z.∂z∂y
Applying the chain rule to right term of eq.(2.1):
Here, z=t−(wo.wh.i),x=z2, and y=wo ,therefore, 12.∂(t−(wo.wh.i))2∂wo=12.∂x∂y=12.∂x∂z.∂z∂y=12.∂z2∂z.∂z∂y=12.∂(t−(wo.wh.i))2∂(t−(wo.wh.i)).∂(t−(wo.wh.i))∂wo=12.2(t−(wo.wh.i)).∂(t−(wo.wh.i))∂wo=(t−(wo.wh.i)).∂(t−(wo.wh.i))∂wo ... eq.(2.2)=(t−(wo.wh.i)).(∂t∂wo−∂(wo.wh.i))∂wo)=(t−(wo.wh.i)).(0−wh.i.∂(wo))∂wo)=(t−(wo.wh.i)).(−wh.i)=−(t−(wo.wh.i)).(wh.i) ... eq.(2.3)
Therefore from eq. (2.3),
∂E∂wo=−(t−(wo.wh.i)).(wh.i) ... eq.(2.4)
Let us now find ∂E∂wh.
E=(t−(wo.(wh.i)))22 (from eq.(1.2))∂E∂wh=∂(t−(wo.(wh.i)))22∂wh=12.∂(t−(wo.wh.i))2∂wh ... eq.(3.1)
From the chain rule we know,
dxdy=dxdz.dzdy
For a partial derivative:
∂x∂y=∂x∂z.∂z∂y
Applying the chain rule to right term of eq.(3.1):
Here, z=t−(wo.wh.i),x=z2, and y=wh ,therefore, 12.∂(t−(wo.wh.i))2∂wh=12.∂x∂y=12.∂x∂z.∂z∂y=12.∂z2∂z.∂z∂y=12.∂(t−(wo.wh.i))2∂(t−(wo.wh.i)).∂(t−(wo.wh.i))∂wh=12.2(t−(wo.wh.i)).∂(t−(wo.wh.i))∂wh=(t−(wo.wh.i)).∂(t−(wo.wh.i))∂wh ... eq.(3.2)=(t−(wo.wh.i)).(∂t∂wh−∂(wo.wh.i))∂wh)=(t−(wo.wh.i)).(0−wo.i.∂(wh))∂wh)=(t−(wo.wh.i)).(−wo.i)=−(t−(wo.wh.i)).(wo.i) ... eq.(3.3)
Therefore from eq. (3.3),
∂E∂wh=−(t−(wo.wh.i)).(wo.i) ... eq.(3.4)
Once ∂E∂wo and ∂E∂wh are determined, we know if increasing(or decreasing) wo and wh, will reduce E.
The weights for output layer (wo) will be updated as:
Δwo=−η.∂E∂wo=−η.−(t−(wo.wh.i)).(wh.i)from eq.(2.4)=η.(t−(wo.wh.i)).(wh.i)⇒wo=wo+Δwo
The weights for hidden layer (wh) will be updated as:
Δwh=−η.∂E∂wh=−η.−(t−(wo.wh.i)).(wo.i)from eq.(3.4)=η.(t−(wo.wh.i)).(wo.i)⇒wh=wh+Δwh
Here η is the learning rate and controls the numeric value by which weights are changed. The - sign before the eta is very important. It determines that if E is increasing( ∂E∂wh or ∂E∂wo is > 0) with an increase in wo/wh then the value of wo/wh needs to be decreased, therefore Δwh and Δwo will be < 0.
If E is decreasing (∂E∂wh/ ∂E∂wo is negative) with an increase in wo/wh then we continue increasing the value of (w_o\)/wh, therefore Δwh and Δwo will be > 0.
In this article we looked at how a simple NN can be represented by a single linear equation and use that to derive the back propagation. In the next article, we will look at a more complex NN which uses the Sigmoid activation function.
Comments
Post a Comment