Part-I: A gentle introduction to Neural Network Backpropagation
Back-propagation is one of the fundamental concepts in neural networks (NN). The equations of back-propagation are derived using mathematics, specifically partial derivatives. Many good sources that explain the concept of back-propagation,including one of my favorites books from Tom Mitchell, use a neutral network architecture with multiple neurons that throws a curve ball that is hard to navigate, specially for non-experts.
In this article, we use a simple neural network that has only one neuron/unit in the input, hidden and output layers and then derive the back-propagation formula using standard partial derivatives. To further simplify the explanation, the article is broken into two parts such that the NN in part-I does not use any activation function while in part-II we use the sigmoid activation function.
Note that since we have only 1 weight we can use standard derivatives and not partial derivatives. However, to keep the nomenclature standard as other texts on back-propagation we will use partial derivatives.
This article attempts to explains back-propagation in an easier fashion with a simple NN and explanations at each step. After reading this text readers are encouraged to understand the more complex derivations given in Mitchell's book and elsewhere, to fully grasp the concept of back-propagation. You do need a basic understanding of partial derivatives and one of the best explanations are available in videos by Sal Khan at Khan academy.
Our neural network which does not have any activation function, has 1 input. 1 hidden and 1 output unit(neuron) is given as:
The Error E is defined as E=(t−ro)22, where t is given(expected) output in the training data. This error function is a modification of the Mean Square Error (MSE) error function, where the additional 12 is make the mathematic easier, as we will see later. Note that we can choose another error function, but this one is simple making it suited for our example.
We need find out change in E in relation to the changes in values of weights wo and wh.
Let us first find out how E changes in relation to wo or the rate of change of E w.r.t. wo. This is also called the derivative of E w.r.t wo or ∂E∂wo
E=(t−ro)22∂E∂wo=∂(t−ro)22∂wo=12.∂(t−ro)2∂wo ... eq.(1.1)
Let us now revisit the chain rule:
dxdy=dxdz.dzdy
For a partial derivative:
∂x∂y=∂x∂z.∂z∂y
Applying the chain rule to right term of eq.(1.1):
Here, z=t−ro,x=z2, and y=wo ,therefore, 12.∂(t−ro)2∂wo=12.∂x∂y=12.∂x∂z.∂z∂y=12.∂z2∂z.∂z∂y=12.∂(t−ro)2∂(t−ro).∂(t−ro)∂wo=12.2(t−ro).∂(t−ro)∂wo ... eq.(1.2)
Further,
∂(t−ro)∂wo=∂(t−wo.rh)∂wo=∂t∂wo−∂wo.rh∂wo=∂t∂wo−rh.∂wo∂wo=0−rh (as,∂C∂x=0 and ∂x∂x=1,where C is a constant, like t),
Based on the above, in eq.(1.2),
12.∂(t−ro)2∂wo=12.2(t−ro).∂(t−ro)∂wo=12.2(t−ro).(0−rh)=(t−ro).(−rh)
Therefore,
∂E∂wo=∂(t−ro)22∂wo=12.∂(t−ro)2∂wo=(t−ro).(−rh)=(t−ro).(−wh.i)
Let us now find how E changes in relation to wh or the rate of change of E w.r.t. wh the weight of the hidden layer. This is also called the derivative of E w.r.t wh or ∂E∂wh.
∂E∂wh=∂(t−ro)22∂wh=12.∂(t−ro)2∂wh ... eq.(2.1)
Applying the chain rule to right term of eq.(2.1):
Here, z=t−ro,x=z2, and y=wh ,therefore, 12.∂(t−ro)2∂wh=12.∂x∂y=12.∂x∂z.∂z∂y=12.∂z2∂z.∂z∂y=12.∂(t−ro)2∂(t−ro).∂(t−ro)∂wh=12.2(t−ro).∂(t−ro)∂wh ... eq.(2.2)
Solve for ∂(t−ro)∂wh:
∂(t−ro)∂wh=∂(t−wo.rh)∂whAs,rh=wh.i,∂(t−ro)∂wh=∂(t−wo.wh.i)∂wh=∂t∂wh−∂wo.wh.i∂wh=∂t∂wh−wo.i.∂wh∂wh=0−wo.i (as,∂C∂x=0 and ∂x∂x=1,where C is a constant),
Applying the computed value of ∂(t−ro)∂wh in eq.(2.2),
12.∂(t−ro)2∂wh=12.2(t−ro).∂(t−ro)∂wh=12.2(t−ro).(0−wo.i)=(t−ro).(−wo.i)
Therefore,
∂E∂wh=12.2(t−ro).∂(t−ro)∂wh=(t−ro).(−wo.i)
Both ∂E∂wo and ∂E∂wh determine the rate of change of E w.r.t. wo and wh, respectively.
When the rate of change becomes 0 or close to it, it means we have reached the lowest point in the error space signifying the lowest possible error. The goal is to change the weights in such a way such that we reach this lowest point after multiple iterations.
The rate of change is positive which means if we keep heading in its current direction the rate of change will keep on increasing. To reduce the rate of change, we need to go in the opposite direction. Hence, weights wo and wh have to be modified where the sign of change (increase or decrease in value of the weights) is determined by −∂E∂wo and −∂E∂wh for wo and wh, respectively. Note the negative sign which allows weights to be correctly modified to reduce the rate of change.
Based on this idea, the weight update rule is given as:
The weights for output layer (wo) will be updated as:
Δwo=−η.∂E∂wo=−η.(t−ro).(−wh.i)=η.(t−ro).(wh.i)⇒wo=wo+Δwo
The weights for hidden layer (wh) will be updated as:
Δwh=−η.∂E∂wh=−η.(t−ro).(−wo.i)=η.(t−ro).(wo.i)⇒wh=wh+Δwh
Here η is the learning rate and controls the numeric value by which weights are changed.
The weights can be updated for each training sample (stochastic gradient descent) or after multiple training samples (batch gradient descent) or after exhausting the training data (standard/batch gradient descent).
An important thing to note here is that we do not have to update the weights on actual numeric values of ∂E∂wo and ∂E∂wh, since the only information we need is their ±sign to determine if weights have to be increased or decreased by some constant amount. However, for simplicity the original back-propagation uses their values directly, with η controlling their impact on weight update.
In later versions of back-propagation (like RProp), weights are not updated on numeric values of ∂E∂wo and ∂E∂wh, but their ± sign is used.
In part-II we will derive the backpropagation for this same NN with the sigmoid activation function added for both hidden and output layers.
Love or hate this article ? Have other ideas ? Please leave a comment below !
Excellent article! Congrats!
ReplyDelete