Introducing Convolution Neural Networks with a simple architecture


Convolution Neural Networks (CNNs) are different from standard Neural Networks (NNs) as their input is 2-dimensional vs. 1-dimensional in a standard neural network. While the underlying principles between CNNs and NNs are same, CNNs do introduce some new concepts. In this article, we take a look at how CNNs operate using a very simple CNN architecture, given below.


In this simple CNN, there is one 4x4 input matrix, one 2x2 filter matrix (also known as kernel), a single convolution layer with 1 unit, 1 Rectified Linear Unit (ReLu) layer with one unit, a single pooling layer and a single fully connected (FC) layer. The elements of the filter matrix are equivalent to the unit weights in a standard NN and are updated during the backpropagation phase.

Real-life CNNs are significantly more complex than this with several repeating layers of convolutional, ReLu and pooling layers forming a true deep learning network. All the matrix dimensions are also magnitudes higher to represent high-resolution images. However for purpose of understanding underlying principles, a simple CNN like this works much better.

In image recognition problems for which CNNs are generally used for, the 4x4 input represents pixels of an image and in general there are 3 NxN input matrices each signifying the Red, Green and Blue (RGB) pixel values of a given image. In this case, we can assume that our simple 4x4 image is in grayscale, which allows it to be defined as a single 4x4 matrix.

The 2x2 filter is used to do the actual convolution operation on the input. The filter is 'slided' over the input matrix and a convolution operation is performed over the input.

A convolution operation between two matrices is defined as the sum of all elements of a Hadamard product between the original matrix and the Kernel matrix rotated by 180°, as represented here:


Here we perform a convolution over matrix I using filter matrix F. By rotating F 180°, a Hadamard product (represented by o) of the two matrices is derived. The Hadamard product produces a new matrix where an element i,j of the new matrix is a dot product of element i,j of matrices I and F. Then all elements of the Hadamard product are added to result in a single scalar value as shown below,

$$
\begin{align}
\begin{pmatrix}
i_{00} & i_{10}\\
i_{01} & i_{11}\\
\end{pmatrix}
*
\begin{pmatrix}
w_{00} & w_{10}\\
w_{01} & w_{11}\\
\end{pmatrix}
&=
\begin{pmatrix}
i_{00} & i_{10}\\
i_{01} & i_{11}\\
\end{pmatrix}
o
\begin{pmatrix}
w_{11} & w_{01}\\
w_{10} & w_{00}\\
\end{pmatrix}\\
&=
\begin{pmatrix}
i_{00}.w_{11} & i_{10}.w_{01}\\
i_{01}.w_{10} & i_{11}.w_{00}\\
\end{pmatrix}
\text{ (Hadamard matrix)}\\
&=i_{00}.w_{11} + i_{10}.w_{01}+i_{01}.w_{10} + i_{11}.w_{00}\\
\end{align}
$$

In a CNN, the convolution operation is performed each time the filter is slided over the input matrix, resulting in a new matrix. The size of the new matrix is determined by certain variable like the size of the input matrix and how many pixels(cells) are skipped each time we slide the kernel. The standard formula to determine the size of the new matrix is:
$$
N={{(W - F)} \over S} + 1
$$
Here,
N: is the dimension (rows and columns) of the new square matrix
W: is the dimension (rows and columns) of the input square matrix
F: is the dimension (rows and columns) of the kernel square matrix
S: is the stride, or the number of cells skipped each time the kernel is slided

Note that N has to be an integer value, hence the values of W, F and S have to be chosen wisely. Sometimes padding values are added around the input matrix, to allow N to be an integer if the values of W, F and S do not allow otherwise. The yellow cells are padding of 1 element around the original input matrix.

If padding is added, the dimensions of the new matrix is given by:

$$
N={(W - F + 2.P)\over S} + 1
$$
Here,
N: is the dimension (rows and columns) of the new square matrix
W: is the dimension (rows and columns) of the input square matrix
F: is the dimension (rows and columns) of the kernel square matrix
P: is the number of pixels(cells) of padding added
S: is the stride, or the number of cells skipped each time the kernel is slided

Consider our simple CNN,

Here, the input is a 4x4 matrix that is convolved with a filter of size 2x2 with no padding using a stride of 2 pixels. Let us find the dimensions of the output matrix based on the formula above:

$$
N={(W - F + 2.P)\over S} + 1
$$
Here, W=4,F=2,P=0,S=2, therefore
$$
\begin{align}
N&={(W - F + 2.P)\over S} + 1\\
&={(4 - 2 + 2.0)\over 2} + 1\\
&={(2 + 0)\over 2} + 1\\
&=2
\end{align}
$$
The output matrix is of dimension 2x2 represented by the convolution layer, also called the filter map of the specific convolution layer. This Filter map serves as an input to the ReLu layer which applies the Rectified Linear unit (ReLu) function to each cell in the convolution layer. The ReLu function relu(x) is defined as:\
$$
\begin{align}
relu(x) = max(0,x)
\end{align}
$$
The output of the ReLu function is also a matrix, same dimension as the convolution filter map and with the ReLu function applied for each element of the matrix. It should be clear from the ReLu function that values of matrix elements in the convolution layer and ReLu layer will be same unless the element is negative, in which case the ReLu layer element will be 0.

Further downstream of the ReLu layer is the pooling layers, whose only task is to reduce the dimensions of the matrix used in the ReLu layer. It does this by applying a pooling function with MaxPool being a common function. The function MaxPool is defined as:
$$
\begin{align}
MaxPool = max\{x_{i,j}, \text{ where } i,j\in M_{N\times N}\}
\end{align}
$$

Hence, MaxPool considers all elements of a matrix (\(M_{N\times N}\)) and uses the numerically highest value element as the value of the pooling layer, converting a matrix to a single scalar value. There are some other pooling functions like average pool which can be used. Once, we have a scalar value we apply the same principles as a standard NN. In this case, a fully connection (FC) layer is added which also multiples the scalar value from the pooling values by a weight. More than 1 FC layers are also used, though unlike Convolutional / ReLu /Pooling layers, FC layers do not represent the input as well since they work on a scalar value.

The output of FC layer is the output of the model and just like in standard NNs the training error is calculated based on which the weight of the FC layer and the each value(weight) of the filter matrix is adjusted using backpropagation after going through some training examples (batch gradient descent) or each training example (stochastic gradient descent) . Determining how backpropagation works in a CNN is outside the scope of this article, and has been addressed in another article in this same blog.

Using the CNN above, let us take a look at the steps involved in computing the output for one training example. The initial CNN is given as:


\(i_00)\) through \(i_11)\)represent the elements of input matrix, which is convolved with a 2x2 filter with weights \(w_00)\) through \(w_11)\). Weights \(w_00)\) through \(w_11)\) are given some initial values using an initialization routine. We start from the top-left of the input matrix and slide the kernel by 2 elements to the right,then 2 elements down (since stride is 2). Remember that we have to rotate the filter by 180°, before computing the Hadamard matrix. In the following figures, the colors super imposed on the input matrix show which elements form the dot product in the Hadamard matrix.

Step 1:

Based on this operation, \(c_{00}\) in the convolution filter map is computed as:
$$
\begin{align}
c_{00}=i_{00}.w_{11} + i_{10}.w_{01}+i_{01}.w_{10} + i_{11}.w_{00}
\end{align}
$$

Step 2: Now let us slide the kernel to the right with a stride of 2,


Value of \(c_{10}\):
$$
\begin{align}
c_{10}=i_{20}.w_{11} + i_{30}.w_{01}+i_{21}.w_{10} + i_{31}.w_{00}
\end{align}
$$
Step 3: Since we cannot go further right in the input matrix, the kernel is moved 2 rows down and back to the first column:


Value of \(c_{01}\):
$$
\begin{align}
c_{01}=i_{02}.w_{11} + i_{12}.w_{01}+i_{03}.w_{10} + i_{13}.w_{00}
\end{align}
$$
Step 4: Sliding the kernel to the right:


Value of \(c_{11}\):

$$
\begin{align}
c_{11}=i_{22}.w_{11} + i_{32}.w_{01}+i_{23}.w_{10} + i_{33}.w_{00}
\end{align}
$$
To standardize all values of the filter map it is a good idea to normalize the filter map values w.r.t the filter. This is done by dividing each filter map by the sum of filter values. Hence the new values of \(c_{00}\) through \(c_{11}\) are:
$$
\begin{align}
c_{00}&={(i_{00}.w_{11} + i_{10}.w_{01}+i_{01}.w_{10} + i_{11}.w_{00}) \over (w_{11} + w_{01} + w_{10} + w_{00})}\\
c_{10}&={i_{20}.w_{11} + i_{30}.w_{01}+i_{21}.w_{10} + i_{31}.w_{00})\over (w_{11} + w_{01} + w_{10} + w_{00})}\\
c_{01}&={(i_{02}.w_{11} + i_{12}.w_{01}+i_{03}.w_{10} + i_{12}.w_{00})\over (w_{11} + w_{01} + w_{10} + w_{00})}\\
c_{11}&={(i_{22}.w_{11} + i_{32}.w_{01}+i_{23}.w_{10} + i_{33}.w_{00})\over (w_{11} + w_{01} + w_{10} + w_{00})}\\
\end{align}
$$

Step 5: Once all elements of the filter map have been calculated let us pass these values to the ReLu layer. Note that in actual implementation, the ReLu function for each filter map element can be called as soon as its value is available, and we do not have to wait for the whole filter map to be computed first.


The elements \(u_{00}\) through \(u_{11}\) of the ReLu layer are defined as:

$$
\begin{align}
u_{00}=max(0,c_{00})\\
u_{01}=max(0,c_{01})\\
u_{10}=max(0,c_{10})\\
u_{11}=max(0,c_{11})\\
\end{align}
$$

Step 6: All element values of the ReLu layer are now passed to the pooling layer which applies the MaxPool function to the values.

The output from the pooling layer \(r_p\) is defined as:
$$
\begin{align}
r_p=max\{u_{00},u_{01},u_{10},u_{11}\}
\end{align}
$$
Step 7: The output of the pooling layer is passed to the FC layer which multiplies the value by a single scalar weight, as in a standard NN:


The output from the FC layer \(r_fc\) is defined as:
$$
\begin{align}
r_{fc}=w_{fc}.r_p
\end{align}
$$
Step 8: In the last step the error E is determined using the values of \(r_fc\) and the actual value of the training example(t). Since CNNs are mainly used for image classification, we can say that the value of t determines the value of the image. As an example, we can say that if t=1 the input matrix is a grayscale image of a cat, and if t=0 the input is an image of a dog. Hence, E is defined as:

$$
\begin{align}
E &= {(t - r_{fc})^2\over 2}\\
\end{align}
$$

This error is backpropagated till the convolution layer, based on which the values of the filter weights \(w_{00}\) through \(w_{11}\), and FC layer weight \(w_fc\) are updated.

In another article, we will see how backpropagation works in the context of a CNN.

Love or hate this article ? Have other ideas ? Please leave a comment below !

Comments

  1. Somehow, there is a confusion of notation/operations - thus, convolution in CNN, is actually a cross correlation operation (from digital processin) and not actually convolution, and therefore you should replace formula (remove kernel flip step) with correct one.

    See: https://datascience.stackexchange.com/questions/40533/convolution-and-cross-correlation-in-cnn

    ReplyDelete

Post a Comment

Popular posts from this blog

Part I: Backpropagation mechanics for a Convolutional Neural Network

Part III: Backpropagation mechanics for a Convolutional Neural Network