Part-I: Neural Net Backpropagation using representative equation transformation

Backpropagation in Neural Networks (NN) is a critical step in adjusting the weights of each layer and making the NN model more accurate based on the provided training data.

In previous articles, we derived the backpropagation for a Neural Networks using the architecture of the network. In this series of articles, we convert the neural network architecture into a representative linear equation and use it to derive the backpropagation.

As in other articles, we start with a very simple NN and proceed to more complex NN in subsequent articles. Consider the NN below with 1 hidden layer and only a single neuron in each layer.

We have derived the backpropagation of this NN without using the equation form here.

The output of this NN is $r_o$, and the prediction error for each training sample $E$ is defined as:
$$
E = {(t - r_o)^2\over 2} \text { ...eq.(1.1)}
$$
where t is given(expected) output for that specific training example. Our goal is minimize $E$ by making $r_o$ as close to t.

In the output layer, $r_o$ is given as $r_o=w_o.r_h$. Putting the value of $r_h$ we can write,$r_o=w_o.(w_h.i)$. Hence the linear equation:
$r_o=w_o.(w_h.i)$

is the linear representative equation of the NN shown above. For any input i, we will get its corresponding output $r_o$. Since, i is fixed, when training this equation the only values we can modify is of $w_o$ and $w_h$. While we adopt a brute force approach and arbitrarily modify $w_o$ and $w_h$, we are not guaranteed to come to optimized values of $w_o$ and $w_h$. There is a method to this madness, and that method is backpropagation.

We know that during training, $E = {(t - r_o)^2\over 2}$. Putting the full value of $r_o$, we get:
$$
E = {(t - (w_o.(w_h.i)))^2\over 2} \text { ...eq.(1.2)}
$$
This is representative cost function of the NN. Our goal is to reduce E such than in an ideal case it becomes 0 (this ideal case is extremely rare when training NNs). The way we will reduce E, is by changing (increasing or decreasing) the values of $w_o$ and $w_h$. Hence, the main thing we need to find out is how E changes (increases/decreases) as $w_o$ and $w_h$ are increased (or decreased). This 'rate of change' of E is determined by partial derivatives of E w.r.t. $w_o$ and $w_h$. Or, we need to find ${\partial E\over \partial w_{o}}$ and ${\partial E\over \partial w_{h}}$.

Let us start by finding ${\partial E\over \partial w_{o}}$.
$$
\begin{align}
E &= {(t - (w_o.(w_h.i)))^2\over 2} \text { (from eq.(1.2))}\\
{\partial E\over \partial w_{o}}&={\partial {(t - (w_o.(w_h.i)))^2\over 2}\over \partial w_o}\\
&={1 \over 2}.{\partial {(t - (w_o.w_h.i))^2}\over \partial w_{o}} \text{ ... eq.(2.1)}\\
\end{align}
$$

Let us revisit the chain rule:
$$
{dx \over dy}= {dx \over dz}.{dz \over dy}
$$
For a partial derivative:
$$
{\partial x \over \partial y}={\partial x\over \partial z}.{\partial z\over \partial y}
$$
Applying the chain rule to right term of eq.(2.1):
$$
\text{Here, } z = {t - (w_o.w_h.i)},x=z^2, \text{ and } y=w_o \text{ ,therefore, }\\
\begin{align}
{1 \over 2}.{\partial {(t - (w_o.w_h.i))^2}\over \partial w_{o}}&={1 \over 2}.{\partial x\over \partial y}\\
&={1 \over 2}.{\partial x\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial z^2\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial (t - (w_o.w_h.i))^2\over \partial (t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{o}}\\
&={1 \over 2}.{2(t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{o}}\\
&={(t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{o}} \text{ ... eq.(2.2)}\\
&={(t - (w_o.w_h.i))}.({\partial t \over \partial w_{o}} - {\partial (w_o.w_h.i)) \over \partial w_{o}})\\
&={(t - (w_o.w_h.i))}.({0} - w_h.i.{\partial (w_o)) \over \partial w_{o}})\\
&={(t - (w_o.w_h.i))}.(-w_h.i) \\
&=-{(t - (w_o.w_h.i))}.(w_h.i) \text{ ... eq.(2.3)}\\
\end{align}
$$
Therefore from eq. (2.3),
$$
\begin{align}
{\partial E\over \partial w_{o}}&=-{(t - (w_o.w_h.i))}.(w_h.i)\text { ... eq.(2.4)}\\
\end{align}
$$
Let us now find ${\partial E\over \partial w_{h}}$.
$$
\begin{align}
E &= {(t - (w_o.(w_h.i)))^2\over 2} \text { (from eq.(1.2))}\\
{\partial E\over \partial w_{h}}&={\partial {(t - (w_o.(w_h.i)))^2\over 2}\over \partial w_h}\\
&={1 \over 2}.{\partial {(t - (w_o.w_h.i))^2}\over \partial w_{h}} \text{ ... eq.(3.1)}\\
\end{align}
$$
From the chain rule we know,
$$
{dx \over dy}= {dx \over dz}.{dz \over dy}
$$
For a partial derivative:
$$
{\partial x \over \partial y}={\partial x\over \partial z}.{\partial z\over \partial y}
$$
Applying the chain rule to right term of eq.(3.1):
$$
\text{Here, } z = {t - (w_o.w_h.i)},x=z^2, \text{ and } y=w_h \text{ ,therefore, }\\
\begin{align}
{1 \over 2}.{\partial {(t - (w_o.w_h.i))^2}\over \partial w_{h}}&={1 \over 2}.{\partial x\over \partial y}\\
&={1 \over 2}.{\partial x\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial z^2\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial (t - (w_o.w_h.i))^2\over \partial (t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{h}}\\
&={1 \over 2}.{2(t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{h}}\\
&={(t - (w_o.w_h.i))}.{\partial (t - (w_o.w_h.i))\over \partial w_{h}} \text{ ... eq.(3.2)}\\
&={(t - (w_o.w_h.i))}.({\partial t \over \partial w_{h}} - {\partial (w_o.w_h.i)) \over \partial w_{h}})\\
&={(t - (w_o.w_h.i))}.({0} - w_o.i.{\partial (w_h)) \over \partial w_{h}})\\
&={(t - (w_o.w_h.i))}.(-w_o.i) \\
&=-{(t - (w_o.w_h.i))}.(w_o.i) \text{ ... eq.(3.3)}\\
\end{align}
$$
Therefore from eq. (3.3),
$$
\begin{align}
{\partial E\over \partial w_{h}}&=-{(t - (w_o.w_h.i))}.(w_o.i)\text { ... eq.(3.4)}\\
\end{align}
$$

Once ${\partial E\over \partial w_o}$ and ${\partial E\over \partial w_h}$ are determined, we know if increasing(or decreasing) $w_o$ and $w_h$, will reduce E.

The weights for output layer ($w_o$) will be updated as:
$$
\begin{align}
\Delta w_o&=-\eta.{\partial E\over \partial w_o}\\
&=-\eta.-{(t - (w_o.w_h.i))}.(w_h.i) \text {from eq.(2.4)}\\
&=\eta.(t - (w_o.w_h.i)).(w_h.i)\\
\Rightarrow w_o &= w_o + \Delta w_o
\end{align}
$$
The weights for hidden layer ($w_h$) will be updated as:
$$
\begin{align}
\Delta w_h&=-\eta.{\partial E\over \partial w_h}\\
&=-\eta.-{(t - (w_o.w_h.i))}.(w_o.i) \text {from eq.(3.4)}\\
&=\eta.(t - (w_o.w_h.i)).(w_o.i)\\
\Rightarrow w_h &= w_h + \Delta w_h
\end{align}
$$
Here $\eta$ is the learning rate and controls the numeric value by which weights are changed. The - sign before the eta is very important. It determines that if E is increasing( $\partial E\over \partial w_h$ or $\partial E\over \partial w_o$ is > 0) with an increase in $w_o$/$w_h$ then the value of $w_o$/$w_h$ needs to be decreased, therefore $\Delta w_h$ and $\Delta w_o$ will be < 0.

If E is decreasing ($\partial E\over \partial w_h$/ $\partial E\over \partial w_o$ is negative) with an increase in $w_o$/$w_h$ then we continue increasing the value of (w_o\)/$w_h$, therefore $\Delta w_h$ and $\Delta w_o$ will be > 0.

In this article we looked at how a simple NN can be represented by a single linear equation and use that to derive the back propagation. In the next article, we will look at a more complex NN which uses the Sigmoid activation function.

Search This Blog

Chaitanya Belwal's Blog

Part-I: Neural Net Backpropagation using representative equation transformation

Comments

Post a Comment

Popular posts from this blog

Python: Sockets Programming - Non-blocking Client

Introducing Convolution Neural Networks with a simple architecture

Part III: Backpropagation mechanics for a Convolutional Neural Network