Part I: Backpropagation mechanics for a Convolutional Neural Network



In another article, we explained the basic mechanism of how a Convolutional Neural Network (CNN) works. In this article we explain the mechanics backpropagation w.r.t to a CNN and derive it value. We use the same simple CNN as used int he previous article, except to make it more simple we remove the ReLu layer. In part-II of this article we derive the backpropagation in the same CNN with the addition of a ReLu layer.

The CNN we use is given below:




In this simple CNN, there is one 4x4 input matrix, one 2x2 filter matrix (also known as kernel), a single convolution layer with 1 unit, a single pooling layer (which applied the MaxPool function) and a single fully connected (FC) layer. The elements of the filter matrix are equivalent to the unit weights in a standard NN and will be updated during the backpropagation phase.

Assuming a stride of 2 with no padding, the size of the convolution layer is determined by the following equation:

N=(WF+2.P)S+1
Here,
N: is the dimension (rows and columns) of the new(convolution) square matrix
W: is the dimension (rows and columns) of the input square matrix
F: is the dimension (rows and columns) of the filter square matrix
P: is the number of pixels(cells) of padding added
S: is the stride, or the number of cells skipped each time the kernel is slided

Putting these values in the equation,
N=(42+2.0)2+1=2
Hence the convolution layer will have a matrix of size 2x2.

The Error E is defined as E=(trfc)22, where t is given(expected) output in the training data.

We need find out change in E in relation to the changes in values of weights wfc and w00 through w11 given in the filter.

Let us first find out how E changes in relation to wfc or the rate of change of E w.r.t. wfc. This is also called the derivative of E w.r.t wfc or Ewfc

E=(trfc)22Ewfc=(trfc)22wfc=12.(trfc)2wfc ... eq.(1.1)


As per the chain rule,:
dxdy=dxdz.dzdy
For a partial derivative:
xy=xz.zy
Applying the chain rule to right term of eq.(1.1):
Here, z=trfc,x=z2, and y=wfc ,therefore, 12.(trfc)2wfc=12.xy=12.xz.zy=12.z2z.zy=12.(trfc)2(trfc).(trfc)wfc=12.2(trfc).(trfc)wfc ... eq.(1.2)
Further,
(trfc)wfc=(twfc.rp)wfc=twfcwfc.rpwfc=twfcrp.wfcwfc=0rp (as,Cx=0 and xx=1,where C is a constant),
Based on the above, in eq.(1.2),
12.(trfc)2wfc=12.2(trfc).(trfc)wfc=12.2(trfc).(0rp)=(trfc).(rp)


Therefore,
Ewfc=(trfc)22wfc=12.(trfc)2wfc=(trfc).(rp)

Further upstream the only variables that control the value of rfc are in the filter F. Hence, we need to find out how E changes in relation to F or EF.

Since the pooling layer applies the MaxPool function, output rp of pooling layer is given as:
rp=max{c00,c10,c10,c11}
The values of c00,c10,c10,c11 are given as (see CNN article for more details):
c00=i00.w11+i10.w01+i01.w10+i11.w00c10=i20.w11+i30.w01+i21.w10+i31.w00c01=i02.w11+i12.w01+i03.w10+i12.w00c11=i22.w11+i32.w01+i23.w10+i33.w00
Note that for the sake of simplicity, we do not normalize values of c00,c10,c10,c11.

When you implement a CNN, the pooling layer has to remember the indices of the convolution layer filter maps that were used as output of the pooling. This is the approach that has been used in the Dasmic machine learning library. In this case, let us just assume that c10 was the output of the pooling layer, or:
c10=max{c00,c10,c10,c11}rp=c10 ... eq.(1.3)
To compute EF, we will need to find out how E changes in relation to each element of matrix F, or we need to find,Ew00,Ew10,Ew01,Ew11.

Let us now compute, Ew00,

Applying the chain rule,
xy=xz.zyHere, z=c10,x=E, and y=w00 ,therefore, Ew00=Ec10.c10w00 ... eq.(2.1)
Solve for the right term c10w00 in eq.(2.1):
c10w00=(i20.w11+i30.w01+i21.w10+i31.w00)w00 (substituting value of c_{10})=(i20.w11)w00+(i30.w01)w00+(i21.w10)w00+(i31.w00)w00=(i20.w11)w00+(i30.w01)w00+(i21.w10)w00+i31.w00w00
Note that the terms (i20.w11),(i30.w01) and (i21.w10) will not change w.r.t w00 hence, can be treated as a constant C
As, Cx=0
c10w00=(i20.w11)w00+(i30.w01)w00+(i21.w10)w00+i31.w00w00=0+0+0+i31.w00w00=0+0+0+i31(as, xx=0)=i31 ... eq.(2.2)
Let us now find the left term of eq. (2.1) Ec10. As,
E=(trfc)22, where t is given(expected) output in the training data,
Ec10=(trfc)22c10=12.(trfc)2c10 ... eq.(2.3)
For a partial derivative:
xy=xz.zy
Applying the chain rule to right term of eq.(2.3):
Here, z=trfc,x=z2, and y=c10 ,therefore, 12.(trfc)2c10=12.xy=12.xz.zy=12.z2z.zy=12.(trfc)2(trfc).(trfc)c10=12.2(trfc).(trfc)c10 ... eq.(2.4)
Further,
(trfc)c10=(twfc.rp)c10=tc10wfc.rpc10 from eq.(1.3) we know,rp=c10,(trfc)c10=tc10wfc.c10c10=0wfc (as,Cx=0 and xx=1,where C is a constant),
Applying value of (trfc)c10 in eq(2.4),
12.(trfc)2c10=12.2(trfc).(trfc)c10=12.2(trfc).(0wfc)=(trfc).(0wfc)
Applying value of 12.(trfc)2c10 in eq.(2.3),
Ec10=(trfc)22c10=(trfc).(0wfc)=(trfc).(wfc) ... eq.(2.5)

Applying values from eq.(2.2) and eq.(2.5) in eq. (2.1),
Ew00=Ec10.c10w00 ... eq.(2.1)=Ec10.(i31) from eq.(2.2)=(trfc).(wfc).(i31) from eq.(2.5) ... eq. (2.6)

Using similar steps as for Ew00 let us now compute Ew10,

Applying the chain rule as before,
xy=xz.zyHere, z=c10,x=E, and y=w10 ,therefore, Ew00=Ec10.c10w10 ... eq.(3.1)
Solve for the right term c10w10 in eq.(3.1):
c10w10=(i20.w11+i30.w01+i21.w10+i31.w00)w10 (substituting value ofc10)=(i20.w11)w10+(i30.w01)w10+(i21.w10)w10+(i31.w00)w10=(i20.w11)w10+(i30.w01)w10+i21.w10w10+(i31.w00)w10
Note that (i20.w11),(i30.w01) and (i31.w00) will not change w.r.t w00 hence, can be treated as a constant C
As, Cx=0
c10w00=(i20.w11)w10+(i30.w01)w10+i21.w10w10+i31.w00w10=0+0+i21.w00w00+0=0+0+i21+0 (as, xx=0)=i21 ... eq.(3.2)
Let us now find the left term of eq. (3.1) Ec10. As,
(E=(trfc)22, where t is given(expected) output in the training data,
Ec10=(trfc)22c10=12.(trfc)2c10 ... eq.(3.3)
For a partial derivative:
xy=xz.zy
Applying the chain rule to right term of eq.(3.3):
Here, z=trfc,x=z2, and y=c10 ,therefore, 12.(trfc)2c10=12.xy=12.xz.zy=12.z2z.zy=12.(trfc)2(trfc).(trfc)c10=12.2(trfc).(trfc)c10 ... eq.(3.4)
Further,
(trfc)c10=(twfc.rp)c10=tc10wfc.rpc10 from eq.(1.3) we know,rp=c10,=tc10wfc.c10c10=0wfc (as,Cx=0 and xx=1,where C is a constant),
Applying value of (trfc)c10 in eq(3.4),
12.(trfc)2c10=12.2(trfc).(trfc)c10=12.2(trfc).(0wfc)=(trfc).(0wfc)
Applying value of 12.(trfc)2c10 in eq.(2.3),
Ec10=(trfc)22c10=(trfc).(0wfc)=(trfc).(wfc) ... eq.(3.5)

Applying values from eq.(3.2) and eq.(3.5) in eq.(3.1),
Ew10=Ec10.c10w10 ... eq.(3.1)=Ec10.(i21) from eq.(3.2)=(trfc).(wfc).(i21) from eq.(3.5) ... eq.(3.6)
It will be clear by now that any Ewxx is the term (trfc).(wfc) multiplied by the element in the input matrix wxx.Therefore for the remaining weights w01 and w11:
Ew01=(trfc).(wfc).(i30) ... eq.(4.1)Ew11=(trfc).(wfc).(i20) ... eq.(5.1)
In the article on introduction to backpropagation, we study backprogagation mechanics for this simple NN:


Here,
Ewh=(tro).(wo.i)

The derivatives of Ew00,Ew01,Ew10,Ew11 are in the same format, wherein we multiply the the error between training example and computed output, by the weight of the downstream layer and the input.

Weight update rule:

Once the direction of the weight update is determined using Ewxy for a weight wxy, we update the weights based on the same rule as for a standard NN. Therefore,

If η is the learning rate and controls the numeric value by which weights are changed:

The weights for FC layer (wfc) layer will be updated as:
Δwfc=η.Ewfc=η.(trfc).(wfc).(i31) (from eq.(1.6))=η.(trfc).(wfc).(i31)wfc=wfc+Δwfc
The weights for filter (w00,w10,w01,w11) will be updated as:
Δw00=η.Ew00=η.(trfc).(wfc).(i31) (from eq.(2.6))=η.(trfc).(wfc).(i31)w00=w00+Δw00
Δw10=η.Ew10=η.(trfc).(wfc).(i21) (from eq.(3.6))=η.(trfc).(wfc).(i21)w10=w10+Δw10
Δw01=η.Ew01=η.(trfc).(wfc).(i30) (from eq.(4.1))=η.(trfc).(wfc).(i30)w01=w01+Δw01
Δw11=η.Ew11=η.(trfc).(wfc).(i20) (from eq.(5.1))=η.(trfc).(wfc).(i20)w11=w11+Δw11

Note that if the output of the pooling layer is different than c10 you have to change the input values (ixy) used in the weight update rule.

This concludes the derivation of backpropagation for our simple CNN. In part-II, we take a look how backpropagation is impacted by adding a rectified linear unit (ReLu) activation layer.


Comments

  1. I think this equations are "wrong",
    c00=i00.w11+i10.w01+i01.w10+i11.w00c10=i20.w11+i30.w01+i21.w10+i31.w00c01=i02.w11+i12.w01+i03.w10+i12.w00c11=i22.w11+i32.w01+i23.w10+i33.w00
    I mean you didn't follow the notation given in the convolution picture!

    ReplyDelete
  2. sorry, I always view this networks as correlation neural networks!

    ReplyDelete

Post a Comment

Popular posts from this blog

Introducing Convolution Neural Networks with a simple architecture

Python: Sockets Programming - Non-blocking Client