Part-II: A gentle introduction to Neural Network Backpropagation


In part-I, we derived the back-propagation formula using a simple neural net architecture that does not use an activation function. In this article, which is a follow on to part-I we look at the same NN architecture, with the Sigmoid activation function added to both hidden and output layers.

Note that since we have only 1 weight we can use standard derivatives and not partial derivatives. However, to keep the nomenclature standard as other texts on back-propagation we will use partial derivatives.

This article attempts to explains back-propagation in an easier fashion with a simple NN and explanations at each step. After reading this text readers are encouraged to understand the more complex derivations given in Mitchell's book and elsewhere, to fully grasp the concept of back-propagation. You do need a basic understanding of partial derivatives and one of the best explanations are available in videos by Sal Khan at Khan academy.

The neural network which has 1 input, 1 hidden and 1 output unit(neuron) along with sigmoid activation function at hidden and output layer is given as:



Here, \(r_h=\sigma(w_h.i)\) and \(r_o=\sigma(w_o.r_h)\).

If t is given(expected) output in the training data, the Error E is defined as:

$$
\begin{align}
E &= {(t - r_o)^2\over 2}\\
{\partial E\over \partial w_{o}}&={\partial {(t - r_o)^2\over 2}\over \partial w_o}\\
&={1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}} \text{ ... eq.(1.1)}
\end{align}
$$


Let's revisit the chain rule:
$$
{dx \over dy}= {dx \over dz}.{dz \over dy}
$$
For a partial derivative form:
$$
{\partial x \over \partial y}={\partial x\over \partial z}.{\partial z\over \partial y}
$$
Applying the chain rule to the right side of eq.(1.1),
$$
\text{Here, } z = {t - r_o},x=z^2, \text{ and } y=w_o \text{ ,therefore, }\\
\begin{align}
{1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}}&={1 \over 2}.{\partial {x}\over \partial y}\\
&={1 \over 2}.{\partial x\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial z^2\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial (t - r_o)^2\over \partial (t - r_o)}.{\partial (t - r_o)\over \partial w_o}\\
&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{o}} \text{ ... eq.(1.2)}\\
\end{align}
$$
Further,
$$
\begin{align}
{\partial (t - r_o)\over \partial w_o}&={\partial (t - \sigma(w_o.r_h))\over \partial w_o}\\
&={\partial t\over \partial w_o} - {\partial \sigma(w_o.r_h)\over \partial w_o}\\
&=0 - {\partial \sigma(w_o.r_h)\over \partial w_o} \text{ (as,} {\partial C \over \partial x}=0, \text {where C is a constant)}, \text{ ... eq.(1.3)}\\
\end{align}
$$
Let's solve for \({\partial \sigma(w_o.r_h)\over \partial w_o}\),
Applying the chain rule again,
$$
\text{Here, } z = {w_o.r_h},x=\sigma (z), \text{ and } y=w_o \text{ ,therefore, }\\
\begin{align}
{\partial \sigma(w_o.r_h)\over \partial w_o}&={\partial x \over \partial y}\\
&={\partial x\over \partial z}.{\partial z\over \partial y}\\
&={\partial \sigma(z)\over \partial z}.{\partial z\over \partial y} \\
&={\partial \sigma (w_o.r_h)\over \partial (w_o.r_h)}.{\partial (w_o.r_h)\over \partial w_o} \text{
... eq.(1.4)}\\
\end{align}
$$
The sigmoid function is \({\sigma(x)} = {1 \over {1+e^{-x}}}\), with its derivative being:
$$
{d\sigma(x) \over dx} = {\sigma(x).(1 - \sigma(x))}
$$
Applying the sigmoid differentiation rule to eq.(1.4),
$$
\begin{align}
{\partial \sigma(w_o.r_h)\over \partial w_o}&={\partial \sigma (w_o.r_h)\over \partial (w_o.r_h)}.{\partial (w_o.r_h)\over \partial w_o}\\
&={\sigma (w_o.r_h).(1- \sigma (w_o.r_h)}).{\partial (w_o.r_h)\over \partial w_o}\\
&={\sigma (w_o.r_h).(1- \sigma (w_o.r_h)}).r_h.{\partial (w_o)\over \partial w_o}\\
&={\sigma (w_o.r_h).(1- \sigma (w_o.r_h)}).{r_h}\\
\end{align}
$$
Therefore in eq.(1.3),
$$
\begin{align}
{\partial (t - r_o)\over \partial w_o}&=0 - {\partial \sigma(w_o.r_h)\over \partial w_o} \\
&=0 - {\sigma (w_o.r_h).(1- \sigma (w_o.r_h)}).{r_h}
\end{align}
$$
And in eq.(1.2),
$$
\begin{align}
{1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}}&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{o}}\\
&=(t - r_o).(0 - {\sigma (w_o.r_h).(1 - \sigma (w_o.r_h)}).{r_h})
\end{align}
$$

Using the above derived value of \({1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}}\),
$$
\begin{align}
\partial E\over \partial w_{o}&={\partial {(t - r_o)^2\over 2}\over \partial w_o}\\
&={1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}}\\
&=(t - r_o).(0 - {\sigma (w_o.r_h).(1 - \sigma (w_o.r_h)}).{r_h})\\
&=-(t - r_o).{\sigma (w_o.r_h).(1 - \sigma (w_o.r_h)})).{r_h}\\
&=-(t - r_o).{r_o.(1 - r_o)}.{r_h} \text { (as,} r_o = \sigma(w_o.r_h) \text{)} \\
\end{align}
$$
Let us now find how \(E\) changes in relation to \(w_h\) or the rate of change of \(E\) w.r.t. \(w_h\) the weight of the hidden layer. This is also called the derivative of \(E\) w.r.t \(w_h\) or \(\partial E\over \partial w_h\).
$$
\begin{align}
{\partial E\over \partial w_{h}}&={\partial {(t - r_o)^2\over 2}\over \partial w_{h}}\\
&={1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{h}} \text{ ...eq.(2.1)}\\
\end{align}
$$

Applying the chain rule to the right side of eq.(2.1),
$$
\text{Here, } z = {t - r_o},x=z^2, \text{ and } y=w_h \text{ ,therefore, }\\
\begin{align}
{1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{h}}&={1 \over 2}.{\partial {x}\over \partial y}\\
&={1 \over 2}.{\partial x\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial z^2\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial (t - r_o)^2\over \partial (t - r_o)}.{\partial (t - r_o)\over \partial w_h}\\
&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{h}} \text{ ... eq.(2.2)}\\
\end{align}
$$

Solve for the right side of eq.(2.2),\({\partial (t - r_o)\over \partial w_h}\),
$$
\begin{align}
{\partial (t - r_o)\over \partial w_h}&={\partial (t - \sigma(w_o.r_h))\over \partial w_h}\\
&={\partial t\over \partial w_h} - {\sigma(w_o.r_h) \over \partial w_h}\\
&=0 - {\sigma(w_o.r_h) \over \partial w_h} \text{ ...eq.(2.3)}\
\end{align}
$$
Applying the chain rule to right side of eq.(2.3), \({\sigma(w_o.r_h) \over \partial w_h}\)
$$
\text{Here, } z = {w_o.r_h},x=\sigma(z), \text{ and } y=w_h \text{ ,therefore, }\\
\begin{align}
{\partial \sigma(w_o.r_h)\over \partial w_h}&={\partial x \over \partial y}\\
&={\partial x\over \partial z}.{\partial z\over \partial y}\\
&={\partial \sigma(z)\over \partial z}.{\partial z\over \partial y} \\
&={\partial \sigma (w_o.r_h)\over \partial (w_o.r_h)}.{\partial (w_o.r_h)\over \partial w_h} \text{
... eq.(2.4)}\\
\end{align}
$$
We know,
$$
{d\sigma(x) \over dx} = {\sigma(x).(1 - \sigma(x))}
$$
Applying the sigmoid differentiation rule to eq.(2.4),
$$
\begin{align}
{\partial \sigma(w_o.r_h)\over \partial w_h}&={\partial \sigma (w_o.r_h)\over \partial (w_o.r_h)}.{\partial (w_o.r_h)\over \partial w_h}\\
&={\sigma(w_o.r_h).(1- \sigma (w_o.r_h)}).{\partial (w_o.r_h)\over \partial w_h}\\
&={r_o.(1- r_o}).{\partial (w_o.r_h)\over \partial w_h} \text {(as,} r_o =\sigma(w_o.r_h) \text{)} \text{ ... eq.(2.5)} \\
\end{align}
$$
Solve for right side of eq.(2.5), \({\partial (w_o.r_h)\over \partial w_h}\),
$$
\begin{align}
{\partial (w_o.r_h)\over \partial w_h}&={\partial (w_o.{\sigma(w_h.i)})\over \partial w_h} (\text {as,} r_h = \sigma(w_h.i))\\
&=w_o.{\partial {\sigma(w_h.i)}\over \partial w_h} \text{ ... eq.(2.6)}\\
\end{align}
$$
Applying the chain rule to right side of eq.(2.6), \({\partial {\sigma(w_h.i)}\over \partial w_h}\),
$$
\text{Here, } z = {w_h.i},x=\sigma(z), \text{ and } y=w_h \text{ ,therefore, }\\
\begin{align}
{\partial {\sigma(w_h.i)}\over \partial w_h}&={\partial x \over \partial y}\\
&={\partial x\over \partial z}.{\partial z\over \partial y}\\
&={\partial \sigma(z)\over \partial z}.{\partial z\over \partial y} \\
&={\partial \sigma (w_h.i)\over \partial (w_h.i)}.{\partial (w_h.i)\over \partial w_h} \\
&={\sigma(w_h.i) (1- \sigma(w_h.i)}.(i.{\partial w_h\over \partial w_h}) \text { (as,} {d\sigma(x) \over dx} = {\sigma(x).(1 - \sigma(x))} \text{)}\\
&={\sigma(w_h.i).(1- \sigma(w_h.i))}.i \text{ ... eq.(2.7)} \\
\end{align}
$$
Hence in eq.(2.6),
$$
\begin{align}
{\partial (w_o.r_h)\over \partial w_h}&=w_o.{\partial {\sigma(w_h.i)}\over \partial w_h}\\
&=w_o.{\sigma(w_h.i).(1-\sigma(w_h.i))}.i \text { (from, eq.(2.7))} \text { ...eq.(2.8)}
\end{align}
$$
Applying this value in eq.(2.5),
\begin{align}
{\partial \sigma(w_o.r_h)\over \partial w_h}&={r_o.(1 - r_o}).{\partial (w_o.r_h)\over \partial w_h}\\
&={r_o.(1 - r_o}).w_o.{\sigma(w_h.i).(1-\sigma(w_h.i))}.i \text { (from, eq.(2.8))}\\
\end{align}
Applying this in eq.(2.3),
\begin{align}
{\partial (t - r_o)\over \partial w_h}&=0 - {\sigma(w_o.r_h) \over \partial w_h} \\
&=0 - {r_o.(1 - r_o}).w_o.{\sigma(w_h.i).(1-\sigma(w_h.i))}.i
\end{align}
And in eq.(2.2),
$$
\begin{align}
{1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{h}}&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{h}}\\
&={1 \over 2}.{2(t - r_o)}.{0 - {r_o.(1 - r_o}).w_o.{\sigma(w_h.i).(1-\sigma(w_h.i))}.i} \text{ (from, eq.(2.3)}
\end{align}
$$

Using the above derived value of \({1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{h}}\),
$$
\begin{align}
{\partial E\over \partial w_{h}}&={\partial {(t - r_o)^2\over 2}\over \partial w_{h}}\\
&={1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{h}} \\
&={1 \over 2}.{2(t - r_o)}.{(0 - {r_o.(1 - r_o}).w_o.{\sigma(w_h.i).(1-\sigma(w_h.i))}.i)} \\
&=(t - r_o).{(0 - {r_o.(1 - r_o}).w_o.{\sigma(w_h.i).(1-\sigma(w_h.i))}.i)}\\
&=-(t - r_o).{r_o.(1 - r_o).w_o.r_h.(1-r_h).i} \text{ (as,} r_h=\sigma(w_h.i) \text{)} \\
\end{align}
$$

Both \({\partial E\over \partial w_o}\) and \({\partial E\over \partial w_h}\) determine the rate of change of \(E\) w.r.t. \(w_o\) and \(w_h\), respectively.

When the rate of change becomes 0 or close to it, it means we have reached the lowest point in the error space signifying the lowest possible error. The goal is to change the weights in such a way such that we reach this lowest point after multiple iterations.

The rate of change is positive which means if we keep heading in its current direction the rate of change will keep on increasing. To reduce the rate of change, we need to go in the opposite direction. Hence, weights \(w_o\) and \(w_h\) have to be modified where the sign of change (increase or decrease in value of the weights) is determined by \({-\partial E\over \partial w_o}\) and \({-\partial E\over \partial w_h}\) for \(w_o\) and \(w_h\), respectively. Note the negative sign which allows weights to be correctly modified to reduce the rate of change.

Based on this idea, the weight update rule is given as:

The weights for output (\(w_o\)) layer will be updated as:
$$
\begin{align}
\Delta w_o&=-\eta.{\partial E\over \partial w_o}\\
&=-\eta.-(t - r_o).{r_o.(1 - r_o)}.{r_h}\\
&=\eta.(t - r_o).{r_o.(1 - r_o)}.{r_h}\\
\Rightarrow w_o &= w_o + \Delta w_o
\end{align}
$$
The weights for hidden (\(w_h\)) layer will be updated as:
$$
\begin{align}
\Delta w_h&=-\eta.{\partial E\over \partial w_h}\\
&=-\eta.-(t - r_o).{r_o.(1 - r_o).w_o.r_h.(1-r_h).i}\\
&=\eta.(t - r_o).{r_o.(1 - r_o).w_o.r_h.(1-r_h).i}\\
\Rightarrow w_h &= w_h + \Delta w_h
\end{align}
$$
Here \(\eta\) is the learning rate and controls the numeric value by which weights are changed.

The weights can be updated for each training sample (stochastic gradient descent) or after multiple training samples (batch gradient descent) or after exhausting the training data (standard/batch gradient descent).

An important thing to note here is that we do not have to update the weights on actual numeric values of \({\partial E\over \partial w_o}\) and \({\partial E\over \partial w_h}\), since the only information we need is their \(\pm\)sign to determine if weights have to be increased or decreased. However, for simplicity the original back-propagation uses their values directly, with \(\eta\) controlling their impact on weight update.

In later versions of back-propagation (like RProp), weights are not updated on numeric values of \({\partial E\over \partial w_o}\) and \({\partial E\over \partial w_h}\), but their \(\pm\) sign is used.

This concludes part-II of the esculent introduction to NN backpropagation. In part-III, we will derive backpropagation for a NN with 2 hidden layers and come up with a generic backpropagation equation for deep neural networks.

Love or hate this article ? Have other ideas ? Please leave a comment below !

Comments

Popular posts from this blog

Part III: Backpropagation mechanics for a Convolutional Neural Network

Introducing Convolution Neural Networks with a simple architecture

Deriving Pythagoras' theorem using Machine Learning