Part-I: A gentle introduction to Neural Network Backpropagation




Back-propagation is one of the fundamental concepts in neural networks (NN). The equations of back-propagation are derived using mathematics, specifically partial derivatives. Many good sources that explain the concept of back-propagation,including one of my favorites books from Tom Mitchell, use a neutral network architecture with multiple neurons that throws a curve ball that is hard to navigate, specially for non-experts.

In this article, we use a simple neural network that has only one neuron/unit in the input, hidden and output layers and then derive the back-propagation formula using standard partial derivatives. To further simplify the explanation, the article is broken into two parts such that the NN in part-I does not use any activation function while in part-II we use the sigmoid activation function.

Note that since we have only 1 weight we can use standard derivatives and not partial derivatives. However, to keep the nomenclature standard as other texts on back-propagation we will use partial derivatives.

This article attempts to explains back-propagation in an easier fashion with a simple NN and explanations at each step. After reading this text readers are encouraged to understand the more complex derivations given in Mitchell's book and elsewhere, to fully grasp the concept of back-propagation. You do need a basic understanding of partial derivatives and one of the best explanations are available in videos by Sal Khan at Khan academy.

Our neural network which does not have any activation function, has 1 input. 1 hidden and 1 output unit(neuron) is given as:



The Error \(E\) is defined as \(E = {(t - r_o)^2\over 2}\), where t is given(expected) output in the training data. This error function is a modification of the Mean Square Error (MSE) error function, where the additional \(1\over 2\) is make the mathematic easier, as we will see later. Note that we can choose another error function, but this one is simple making it suited for our example.
We need find out change in \(E\) in relation to the changes in values of weights \(w_o\) and \(w_h\).

Let us first find out how \(E\) changes in relation to \(w_o\) or the rate of change of \(E\) w.r.t. \(w_o\). This is also called the derivative of \(E\) w.r.t \(w_o\) or \(\partial E\over \partial w_o\)

$$
\begin{align}
E &= {(t - r_o)^2\over 2}\\
{\partial E\over \partial w_{o}}&={\partial {(t - r_o)^2\over 2}\over \partial w_o}\\
&={1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}} \text{ ... eq.(1.1)}\\
\end{align}
$$


Let us now revisit the chain rule:
$$
{dx \over dy}= {dx \over dz}.{dz \over dy}
$$
For a partial derivative:
$$
{\partial x \over \partial y}={\partial x\over \partial z}.{\partial z\over \partial y}
$$
Applying the chain rule to right term of eq.(1.1):
$$
\text{Here, } z = {t - r_o},x=z^2, \text{ and } y=w_o \text{ ,therefore, }\\
\begin{align}
{1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}}&={1 \over 2}.{\partial x\over \partial y}\\
&={1 \over 2}.{\partial x\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial z^2\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial (t - r_o)^2\over \partial (t - r_o)}.{\partial (t - r_o)\over \partial w_{o}}\\
&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{o}} \text{ ... eq.(1.2)}\\
\end{align}
$$
Further,
$$
\begin{align}
{\partial (t - r_o)\over \partial w_o}&={\partial (t - w_o.r_h)\over \partial w_o}\\
&={\partial t\over \partial w_o} - {\partial w_o.r_h\over \partial w_o}\\
&={\partial t\over \partial w_o} - r_h.{\partial w_o\over \partial w_o}\\
&=0 - r_h \text{ (as,} {\partial C \over \partial x}=0 \text { and } {\partial x \over \partial x}=1, \text {where C is a constant, like t)},\\
\end{align}
$$
Based on the above, in eq.(1.2),
$$
\begin{align}
{1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}}&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{o}}\\
&={1 \over 2}.{2(t - r_o)}.{(0 - r_h)}\\
&=(t - r_o).(-r_h)
\end{align}
$$


Therefore,
$$
\begin{align}
\partial E\over \partial w_{o}&={\partial {(t - r_o)^2\over 2}\over \partial w_o}\\
&={1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{o}} \\
&=(t - r_o).(-r_h)\\
&=(t - r_o).(-w_h.i)\\
\end{align}
$$
Let us now find how \(E\) changes in relation to \(w_h\) or the rate of change of \(E\) w.r.t. \(w_h\) the weight of the hidden layer. This is also called the derivative of \(E\) w.r.t \(w_h\) or \(\partial E\over \partial w_h\).
$$
\begin{align}
{\partial E\over \partial w_{h}}&={\partial {(t - r_o)^2\over 2}\over \partial w_{h}}\\
&={1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{h}} \text{ ... eq.(2.1)}\\
\end{align}
$$

Applying the chain rule to right term of eq.(2.1):
$$
\text{Here, } z = {t - r_o},x=z^2, \text{ and } y=w_h \text{ ,therefore, }\\
\begin{align}
{1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{h}}&={1 \over 2}.{\partial x\over \partial y}\\
&={1 \over 2}.{\partial x\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial z^2\over \partial z}.{\partial z\over \partial y}\\
&={1 \over 2}.{\partial (t - r_o)^2\over \partial (t - r_o)}.{\partial (t - r_o)\over \partial w_{h}}\\
&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{h}} \text{ ... eq.(2.2)}\\
\end{align}
$$
Solve for \({\partial (t - r_o)\over \partial w_h}\):
$$
\begin{align}
{\partial (t - r_o)\over \partial w_h}&={\partial (t - w_o.r_h)\over \partial w_h}\\
\text {As,} r_h = w_h.i,\\
{\partial (t - r_o)\over \partial w_h}&={\partial (t - w_o.w_h.i)\over \partial w_h}\\
&={\partial t\over \partial w_h} - {\partial w_o.w_h.i\over \partial w_h}\\
&={\partial t\over \partial w_h} - {w_o.i}.{\partial w_h\over \partial w_h}\\
&=0 - {w_o.i}\text{ (as,} {\partial C \over \partial x}=0 \text { and } {\partial x \over \partial x}=1, \text {where C is a constant)},\\
\end{align}
$$
Applying the computed value of \({\partial (t - r_o)\over \partial w_h}\) in eq.(2.2),
$$
\begin{align}
{1 \over 2}.{\partial {(t - r_o)^2}\over \partial w_{h}}&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{h}}\\
&={1 \over 2}.{2(t - r_o)}.{(0 - w_o.i)}\\
&=(t - r_o).(-w_o.i)
\end{align}
$$

Therefore,
$$
\begin{align}
{\partial E\over \partial w_h}&={1 \over 2}.{2(t - r_o)}.{\partial (t - r_o)\over \partial w_{h}} \\
&={(t - r_o)}.{(-w_o.i)}
\end{align}
$$

Both \({\partial E\over \partial w_o}\) and \({\partial E\over \partial w_h}\) determine the rate of change of \(E\) w.r.t. \(w_o\) and \(w_h\), respectively.

When the rate of change becomes 0 or close to it, it means we have reached the lowest point in the error space signifying the lowest possible error. The goal is to change the weights in such a way such that we reach this lowest point after multiple iterations.

The rate of change is positive which means if we keep heading in its current direction the rate of change will keep on increasing. To reduce the rate of change, we need to go in the opposite direction. Hence, weights \(w_o\) and \(w_h\) have to be modified where the sign of change (increase or decrease in value of the weights) is determined by \({-\partial E\over \partial w_o}\) and \({-\partial E\over \partial w_h}\) for \(w_o\) and \(w_h\), respectively. Note the negative sign which allows weights to be correctly modified to reduce the rate of change.

Based on this idea, the weight update rule is given as:

The weights for output layer (\(w_o\)) will be updated as:
$$
\begin{align}
\Delta w_o&=-\eta.{\partial E\over \partial w_o}\\
&=-\eta.(t - r_o).(-w_h.i)\\
&=\eta.(t - r_o).(w_h.i)\\
\Rightarrow w_o &= w_o + \Delta w_o
\end{align}
$$
The weights for hidden layer (\(w_h\)) will be updated as:
$$
\begin{align}
\Delta w_h&=-\eta.{\partial E\over \partial w_h}\\
&=-\eta.(t - r_o).(-w_o.i)\\
&=\eta.(t - r_o).(w_o.i)\\
\Rightarrow w_h &= w_h + \Delta w_h
\end{align}
$$
Here \(\eta\) is the learning rate and controls the numeric value by which weights are changed.

The weights can be updated for each training sample (stochastic gradient descent) or after multiple training samples (batch gradient descent) or after exhausting the training data (standard/batch gradient descent).

An important thing to note here is that we do not have to update the weights on actual numeric values of \({\partial E\over \partial w_o}\) and \({\partial E\over \partial w_h}\), since the only information we need is their \(\pm\)sign to determine if weights have to be increased or decreased by some constant amount. However, for simplicity the original back-propagation uses their values directly, with \(\eta\) controlling their impact on weight update.

In later versions of back-propagation (like RProp), weights are not updated on numeric values of \({\partial E\over \partial w_o}\) and \({\partial E\over \partial w_h}\), but their \(\pm\) sign is used.

In part-II we will derive the backpropagation for this same NN with the sigmoid activation function added for both hidden and output layers.

Love or hate this article ? Have other ideas ? Please leave a comment below !

Comments

Post a Comment

Popular posts from this blog

Deriving Pythagoras' theorem using Machine Learning

Part III: Backpropagation mechanics for a Convolutional Neural Network

Introducing Convolution Neural Networks with a simple architecture