Part-I: Neural Net Backpropagation using representative equation transformation


Backpropagation in Neural Networks (NN) is a critical step in adjusting the weights of each layer and making the NN model more accurate based on the provided training data.

In previous articles, we derived the backpropagation for a Neural Networks using the architecture of the network. In this series of articles, we convert the neural network architecture into a representative linear equation and use it to derive the backpropagation.

As in other articles, we start with a very simple NN and proceed to more complex NN in subsequent articles. Consider the NN below with 1 hidden layer and only a single neuron in each layer.


We have derived the backpropagation of this NN without using the equation form here.

The output of this NN is \(r_o\), and the prediction error for each training sample \(E\) is defined as:
$$
E = {(t - r_o)^2\over 2} \text { ...eq.(1.1)}
$$
where t is given(expected) output for that specific training example. Our goal is minimize \(E\) by making \(r_o\) as close to t.

In the output layer, \(r_o\) is given as \(r_o=w_o.r_h\). Putting the value of \(r_h\) we can write,\(r_o=w_o.(w_h.i)\). Hence the linear equation:
  \(r_o=w_o.(w_h.i)\)

is the linear representative equation of the NN shown above. For any input i, we will get its corresponding output \(r_o\). Since, i is fixed, when training this equation the only values we can modify is of \(w_o\) and \(w_h\). While we adopt a brute force approach and arbitrarily modify \(w_o\) and \(w_h\), we are not guaranteed to come to optimized values of \(w_o\) and \(w_h\). There is a method to this madness, and that method is backpropagation.

We know that during training, \(E = {(t - r_o)^2\over 2}\). Putting the full value of \(r_o\), we get:
$$
E = {(t - (w_o.(w_h.i)))^2\over 2} \text { ...eq.(1.2)}
$$
This is representative cost function of the NN. Our goal is to reduce E such than in an ideal case it becomes 0 (this ideal case is extremely rare when training NNs). The way we will reduce E, is by changing (increasing or decreasing) the values of \(w_o\) and \(w_h\). Hence, the main thing we need to find out is how E changes (increases/decreases) as \(w_o\) and \(w_h\) are increased (or decreased). This 'rate of change' of E is determined by partial derivatives of E w.r.t. \(w_o\) and \(w_h\). Or, we need to find \({\partial E\over \partial w_{o}}\) and \({\partial E\over \partial w_{h}}\).

Let us start by finding \({\partial E\over \partial w_{o}}\).
$$
\begin{align}
E &= {(t - (w_o.(w_h.i)))^2\over 2} \text { (from eq.(1.2))}\\
{\partial E\over \partial w_{o}}&={\partial {(t - (w_o.(w_h.i)))^2\over 2}\over \partial w_o}\\
&={1 \over 2}.{\partial {(t - (w_o.w_h.i))^2}\over \partial w_{o}} \text{ ... eq.(2.1)}\\
\end{align}
$$

Let us revisit the chain rule:
$$
{dx \over dy}= {dx \over dz}.{dz \over dy}
$$
For a partial derivative:
$$
{\partial x \over \partial y}={\partial x\over \partial z}.{\partial z\over \partial y}
$$
Applying the chain rule to right term of eq.(2.1):
$$
\text{Here, } z = {t - (w_o.w_h.i)},x=z^2, \text{ and } y=w_o \text{ ,therefore, }\\
\begin{align}
{1 \over 2}.{\partial {(t - (w_o.w_h.i))^2}\over \partial w_{o}}&={1 \over 2}.{\partial x\over \partial y}\\
&={1 \over 2}.{\partial x\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial z^2\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial (t - (w_o.w_h.i))^2\over \partial (t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{o}}\\
&={1 \over 2}.{2(t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{o}}\\
&={(t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{o}} \text{ ... eq.(2.2)}\\
&={(t - (w_o.w_h.i))}.({\partial t \over \partial w_{o}} - {\partial (w_o.w_h.i)) \over \partial w_{o}})\\
&={(t - (w_o.w_h.i))}.({0} - w_h.i.{\partial (w_o)) \over \partial w_{o}})\\
&={(t - (w_o.w_h.i))}.(-w_h.i) \\
&=-{(t - (w_o.w_h.i))}.(w_h.i) \text{ ... eq.(2.3)}\\
\end{align}
$$
Therefore from eq. (2.3),
$$
\begin{align}
{\partial E\over \partial w_{o}}&=-{(t - (w_o.w_h.i))}.(w_h.i)\text { ... eq.(2.4)}\\
\end{align}
$$
Let us now find \({\partial E\over \partial w_{h}}\).
$$
\begin{align}
E &= {(t - (w_o.(w_h.i)))^2\over 2} \text { (from eq.(1.2))}\\
{\partial E\over \partial w_{h}}&={\partial {(t - (w_o.(w_h.i)))^2\over 2}\over \partial w_h}\\
&={1 \over 2}.{\partial {(t - (w_o.w_h.i))^2}\over \partial w_{h}} \text{ ... eq.(3.1)}\\
\end{align}
$$
From the chain rule we know,
$$
{dx \over dy}= {dx \over dz}.{dz \over dy}
$$
For a partial derivative:
$$
{\partial x \over \partial y}={\partial x\over \partial z}.{\partial z\over \partial y}
$$
Applying the chain rule to right term of eq.(3.1):
$$
\text{Here, } z = {t - (w_o.w_h.i)},x=z^2, \text{ and } y=w_h \text{ ,therefore, }\\
\begin{align}
{1 \over 2}.{\partial {(t - (w_o.w_h.i))^2}\over \partial w_{h}}&={1 \over 2}.{\partial x\over \partial y}\\
&={1 \over 2}.{\partial x\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial z^2\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial (t - (w_o.w_h.i))^2\over \partial (t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{h}}\\
&={1 \over 2}.{2(t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{h}}\\
&={(t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{h}} \text{ ... eq.(3.2)}\\
&={(t - (w_o.w_h.i))}.({\partial t \over \partial w_{h}} - {\partial (w_o.w_h.i)) \over \partial w_{h}})\\
&={(t - (w_o.w_h.i))}.({0} - w_o.i.{\partial (w_h)) \over \partial w_{h}})\\
&={(t - (w_o.w_h.i))}.(-w_o.i) \\
&=-{(t - (w_o.w_h.i))}.(w_o.i) \text{ ... eq.(3.3)}\\
\end{align}
$$
Therefore from eq. (3.3),
$$
\begin{align}
{\partial E\over \partial w_{h}}&=-{(t - (w_o.w_h.i))}.(w_o.i)\text { ... eq.(3.4)}\\
\end{align}
$$

Once \({\partial E\over \partial w_o}\) and \({\partial E\over \partial w_h}\) are determined, we know if increasing(or decreasing) \(w_o\) and \(w_h\), will reduce E.

The weights for output layer (\(w_o\)) will be updated as:
$$
\begin{align}
\Delta w_o&=-\eta.{\partial E\over \partial w_o}\\
&=-\eta.-{(t - (w_o.w_h.i))}.(w_h.i) \text {from eq.(2.4)}\\
&=\eta.(t - (w_o.w_h.i)).(w_h.i)\\
\Rightarrow w_o &= w_o + \Delta w_o
\end{align}
$$
The weights for hidden layer (\(w_h\)) will be updated as:
$$
\begin{align}
\Delta w_h&=-\eta.{\partial E\over \partial w_h}\\
&=-\eta.-{(t - (w_o.w_h.i))}.(w_o.i) \text {from eq.(3.4)}\\
&=\eta.(t - (w_o.w_h.i)).(w_o.i)\\
\Rightarrow w_h &= w_h + \Delta w_h
\end{align}
$$
Here \(\eta\) is the learning rate and controls the numeric value by which weights are changed. The - sign before the eta is very important. It determines that if E is increasing( \(\partial E\over \partial w_h\) or \(\partial E\over \partial w_o\) is > 0) with an increase in \(w_o\)/\(w_h\) then the value of \(w_o\)/\(w_h\) needs to be decreased, therefore \(\Delta w_h\) and \(\Delta w_o\) will be < 0.


If E is decreasing (\(\partial E\over \partial w_h\)/ \(\partial E\over \partial w_o\) is negative) with an increase in \(w_o\)/\(w_h\) then we continue increasing the value of (w_o\)/\(w_h\), therefore \(\Delta w_h\) and \(\Delta w_o\) will be > 0.

In this article we looked at how a simple NN can be represented by a single linear equation and use that to derive the back propagation. In the next article, we will look at a more complex NN which uses the Sigmoid activation function.


Comments

Popular posts from this blog

Part III: Backpropagation mechanics for a Convolutional Neural Network

Introducing Convolution Neural Networks with a simple architecture

Deriving Pythagoras' theorem using Machine Learning