Part II: Backpropagation mechanics for a Convolutional Neural Network



In part-I of this article, we derived the weight update equation for a backpropagation operation of a simple Convolutional Neural Network (CNN). The CNN considered in part-I did not use a rectified linear unit (ReLu) layer, and in this article we expand upon the CNN to include a ReLu layer and see how it impacts the backpropagation.

The CNN we use in this article is given below:




In this simple CNN, there is one 4x4 input matrix, one 2x2 filter matrix (also known as kernel), a single convolution layer with 1 unit, a single ReLu layer, a single pooling layer (which applied the MaxPool function) and a single fully connected (FC) layer. The elements of the filter matrix are equivalent to the unit weights in a standard NN and will be updated during the backpropagation phase.

Assuming a stride of 2 with no padding, the size of the convolution layer is determined by the following equation:

N=(WF+2.P)S+1
Here,
N: is the dimension (rows and columns) of the new(convolution) square matrix
W: is the dimension (rows and columns) of the input square matrix
F: is the dimension (rows and columns) of the filter square matrix
P: is the number of pixels(cells) of padding added
S: is the stride, or the number of cells skipped each time the kernel is slided

Putting these values in the equation,
N=(42+2.0)2+1=2
Hence the convolution layer will have a matrix of size 2x2.

The Error E is defined as E=(trfc)22, where t is given(expected) output in the training data.

We need find out change in E in relation to the changes in values of weights wfc and w00 through w11 given in the filter.

Let us first find out how E changes in relation to wfc or the rate of change of E w.r.t. wfc. This is also called the derivative of E w.r.t wfc or Ewfc

E=(trfc)22Ewfc=(trfc)22wfc=12.(trfc)2wfc ... eq.(1.1)


As per the chain rule,:
dxdy=dxdz.dzdy
For a partial derivative:
xy=xz.zy
Applying the chain rule to right term of eq.(1.1):
Here, z=trfc,x=z2, and y=wfc ,therefore, 12.(trfc)2wfc=12.xy=12.xz.zy=12.z2z.zy=12.(trfc)2(trfc).(trfc)wfc=12.2(trfc).(trfc)wfc ... eq.(1.2)
Further,
(trfc)wfc=(twfc.rp)wfc=twfcwfc.rpwfc=twfcrp.wfcwfc=0rp (as,Cx=0 and xx=1,where C is a constant),
Based on the above, in eq.(1.2),
12.(trfc)2wfc=12.2(trfc).(trfc)wfc=12.2(trfc).(0rp)=(trfc).(rp)


Therefore,
Ewfc=(trfc)22wfc=12.(trfc)2wfc=(trfc).(rp) ... eq.(1.3)

Further upstream, the only variables that control the value of rfc are in the filter F. Hence, we need to find out how E changes in relation to F or EF.

Since the pooling layer applies the MaxPool function, output rp of pooling layer is given as:
rp=max{u00,u10,u10,u11}
The ReLu function f(x) is defined as: f(x)=max(0,x)
Hence, the values of u00,u10,u10,u11 are given as (see CNN article for more details):
u00=max(0,c00)=max(0,i00.w11+i10.w01+i01.w10+i11.w00)u10=max(0,c10)=max(0,i20.w11+i30.w01+i21.w10+i31.w00)u01=max(0,c01)=max(0,i02.w11+i12.w01+i03.w10+i12.w00)u11=max(0,c11)=max(0,i22.w11+i32.w01+i23.w10+i33.w00)
Note that for the sake of simplicity, we do not normalize values of c00,c10,c10,c11.

When you implement a CNN, the pooling layer has to remember the indices of the convolution layer filter maps that were used as input to the pooling layer. This is the approach that has been used in the Dasmic machine learning library. In this case, let us just assume that u10 was the output of the pooling layer, or:
u10=max{u00,u10,u10,u11}rp=u10 ... eq.(1.4)
To compute EF, we will need to find out how E changes in relation to each element of matrix F, or we need to find,Ew00,Ew10,Ew01,Ew11.

Let us now compute Ew00,

Applying the chain rule,
xy=xz.zyHere, z=u10,x=E, and y=w00 ,therefore, Ew00=Eu10.u10w00 ... eq.(2.1)
Solve for the right term u10w00 in eq.(2.1):
u10w00=(max(0,c10))w00 ... eq.(2.2)
Applying the chain rule to eq.(2.2),
xy=xz.zyHere, z=c10,x=max(0,c10), and y=w00 ,therefore, (max(0,c10))w00=max(0,c10)c10.c10w00 ... eq.(2.3)
Now,
dmax(0,x)dx={x,if x > 00,if x < 0undetermined,if x = 0 ... eq.(2.4) Let us assume that x0. When x=0 there are several ways of dealing with it, including using a slightly modified ReLu function but that is outside the scope of this article. Let's now look at the two cases when x=c10.

Case 1: x<0,

In eq.(2.3),
(max(0,c10))w00=max(0,c10)c10.c10w00=0c10.c10w00 (applying value of max(0,c10) from eq.(2.4))=0.c10w00=0 ... eq.(2.5)
Putting value from eq.(2.5) in eq.(2.2),
u10w00=(max(0,c10))w00=0
Putting value of u10w00 in eq.(2.1),
Ew00=Eu10.u10w00=Eu10.0=0
Case 2: x>0,
In eq.(2.3),
(max(0,c10))w00=max(0,c10)c10.c10w00=c10c10.c10w00 (applying value of max(0,c10)=c10 from eq.(2.3))=1.c10w00=c10w00 ... eq.(2.6)
Solving for eq.(2.6),
c10w00=(i20.w11+i30.w01+i21.w10+i31.w00)w00 (substituting full value of c10)=(i20.w11)w00+(i30.w01)w00+(i21.w10)w00+(i31.w00)w00=(i20.w11)w00+(i30.w01)w00+(i21.w10)w00+i31.w00w00 ... eq.(2.7)
Note that the terms (i20.w11),(i30.w01) and (i21.w10) will not change w.r.t w00 hence, can be treated as a constant C
As Cx=0, in eq.(2.7),
c10w00=(i20.w11)w00+(i30.w01)w00+(i21.w10)w00+i31.w00w00=0+0+0+i31.w00w00=0+0+0+i31(as, xx=0)=i31 ... eq.(2.8)
Putting value from eq.(2.8) in eq.(2.2),
u10w00=(max(0,c10))w00=c10w00 (from eq.(2.6))=i31 (from eq.(2.8)) ... eq.(2.10)
Let us now find the left term of eq. (2.1) Eu10. As,
E=(trfc)22, where t is given(expected) output in the training data,
Eu10=(trfc)22u10=12.(trfc)2u10 ... eq.(2.11)
For a partial derivative, the chain rule is:
xy=xz.zy
Applying the chain rule to right term of eq.(2.11):
Here, z=trfc,x=z2, and y=u10 ,therefore, 12.(trfc)2u10=12.xy=12.xz.zy=12.z2z.zy=12.(trfc)2(trfc).(trfc)u10=12.2(trfc).(trfc)u10 ... eq.(2.12)
Further,
(trfc)u10=(twfc.rp)u10=tu10wfc.rpu10 from eq.(1.4) we know rp=u10,therefore,(trfc)u10=tu10wfc.u10u10=0wfc (as,Cx=0 and xx=1,where C is a constant),
Applying value of (trfc)u10 in eq(2.12),
12.(trfc)2u10=12.2(trfc).(trfc)u10=12.2(trfc).(0wfc)=(trfc).(0wfc)
Applying value of 12.(trfc)2u11 in eq.(2.11),
Eu10=(trfc)22u10=(trfc).(0wfc)=(trfc).(wfc) ... eq.(2.13)

Applying values from eq.(2.8) and eq.(2.13) in eq. (2.1),
Ew00=Eu10.u10w00=Eu10.(i31) (from eq.(2.10))=(trfc).(wfc).(i31) (from eq.(2.5)) ... eq. (2.14)

Eq.(2.14) can also be converted to a general equation that covers both cases 1 and 2. This general equation is also useful if we use a different activation function other than ReLu. This general form of eq. (2.14) is:
Ew00=Eu10.u10w00=(trfc).(wfc).u10w00=(trfc).(wfc).f(c10)w00 ... eq.(2.15)
Here, f(x) is the activation function.

Using similar steps as for Ew00 let us now compute Ew10,

Applying the chain rule as before,
xy=xz.zyHere, z=u10,x=E, and y=w10 ,therefore, Ew10=Eu10.u10w10 ... eq.(3.1)
Solve for the right term u10w10 in eq.(3.1):
u10w10=(max(0,c10))w10 ... eq.(3.2)
Applying the chain rule to eq.(3.2),
xy=xz.zyHere, z=c10,x=max(0,c10), and y=w10 ,therefore, (max(0,c10))w10=max(0,c10)c10.c10w10 ... eq.(3.3)
We already know,
dmax(0,x)dx={1,if x > 00,if x < 0undetermined,if x = 0 ... eq.(3.4)

Case 1: x<0,


In eq.(3.3),
(max(0,c10))w01=max(0,c10)c10.c10w10=0c10.c10w10 (applying value of max(0,c10) from eq.(3.4))=0.c10w10=0 ... eq.(3.5)
Putting value from eq.(3.5) in eq.(3.2),
u10w10=(max(0,c10))w10=0
Putting value of u10w10 in eq.(3.1),
Ew10=Eu10.u10w10=Eu10.0=0
Case 2: x>0,
In eq.(3.3),
(max(0,c10))w10=max(0,c10)c10.c10w10=c10c10.c10w10 (applying value of max(0,c10)=c10 from eq.(3.3))=1.c10w10=c10w10 ... eq.(3.6)
Solving for eq.(3.6),
c10w10=(i20.w11+i30.w01+i21.w10+i31.w00)w10 (substituting full value of c10)=(i20.w11)w10+(i30.w01)w10+(i21.w10)w10+(i31.w00)w10=(i20.w11)w10+(i30.w01)w10+i21.w10w10+(i31.w00)w10 ... eq.(3.7)
Note that the terms (i20.w11),(i30.w01) and (i31.w00) will not change w.r.t w10 hence, can be treated as a constant C
As Cx=0, in eq.(3.7),
c10w10=(i20.w11)w10+(i30.w01)w10+i21.w10w10+(i31.w00)w10=0+0+i21.w10w10+0=0+0+i21+0(as, xx=0)=i21 ... eq.(3.8)
Putting value from eq.(3.8) in eq.(3.2),
u10w10=(max(0,c10))w10=c10w10 (from eq.(3.6))=i21 (from eq.(3.8)) ... eq.(3.10)
Let us now find the left term of eq. (3.1) Eu10. This has already been computed in eq.(2.13) and given as:
Eu10=(trfc).(wfc)
Applying values from eq.(3.8) and eq.(2.13) in eq. (3.1),
Ew10=Eu10.u10w10=Eu10.(i21) (from eq.(3.10))=(trfc).(wfc).(i21) (from eq.(2.13)) ... eq. (3.11)
Eq.(3.11) can also be converted to a general equation that covers both cases 1 and 2. This general equation is also useful if we use a different activation function other than ReLu. This general form of eq. (3.11 ) is:
Ew10=Eu10.u10w10=(trfc).(wfc).u10w10=(trfc).(wfc).f(c10)w10 ... eq.(3.12)
Here, f(x) is the activation function.

It will be clear by now that any Ewxx is the term (trfc).(wfc) multiplied by the element in the input matrix wxx. Therefore, for the remaining weights w01 and w11:
Ew01=(trfc).(wfc).(i30) ... eq.(4.1)Ew11=(trfc).(wfc).(i20) ... eq.(5.1)
and their general form is:
Ew01=(trfc).(wfc).f(c10)w01 ... eq.(4.2)Ew11=(trfc).(wfc).f(c10)w11 ... eq.(5.2)
A further general form to represent eqs.(2.15),(3.12), (4.2) and (5.2) for any weight wxy is:
Ewxy=(trfc).(wfc).f(cij)wxy ... eq.(6.1)
where, f(cij) is the output from the pooling function and wxy is the filter weight.

Weight update rule:

Once the direction of the weight update is determined using Ewxy for a weight wxy, we update the weights based on the same rule as for a standard NN. Therefore,

If η is the learning rate and controls the numeric value by which weights are changed:

The weights for FC layer (wfc) layer will be updated as:
Δwfc=η.Ewfc=η.(trfc).(rp) (from eq.(1.3))=η.(trfc).(rp)wfc=wfc+Δwfc
The weights for filter (w00,w10,w01,w11) will be updated as:
Δw00=η.Ew00=η.(trfc).(wfc).(f(c10)w00) (from eq.(2.15))=η.(trfc).(wfc).(f(c10)w00)w00=w00+Δw00
Δw10=η.Ew10=η.(trfc).(wfc).(f(c10)w10) (from eq.(3.6))=η.(trfc).(wfc).(f(c10)w10)w10=w10+Δw10
Δw01=η.Ew01=η.(trfc).(wfc).(f(c10)w01) (from eq.(4.2))=η.(trfc).(wfc).(f(c10)w01)w01=w01+Δw01
Δw11=η.Ew11=η.(trfc).(wfc).(f(c10)w11) (from eq.(5.2))=η.(trfc).(wfc).(f(c10)w11)w11=w11+Δw11

Note that if the output of the pooling layer is different than c10, you have to change the input values (ixy) used in the weight update rule.

This concludes the derivation of backpropagation for a simple CNN with the ReLu activation function. In part-III, we take a look at backpropagation for our simple CNN extended to have 3 input matrices representing the RGB values of an image.


Comments

Popular posts from this blog

Part I: Backpropagation mechanics for a Convolutional Neural Network

Introducing Convolution Neural Networks with a simple architecture

Python: Sockets Programming - Non-blocking Client