Part IV: Backpropagation mechanics for a Convolutional Neural Network



In part-III of this series, we derived the weight update equation for the backpropagation operation of a simple Convolutional Neural Network (CNN) using 3 inputs matrices representing the red, blue and green spectrum of an image. In this article, we derive the backpropagation with 2 consecutive convolution and ReLu layers, and subsequently derive a general backpropagation equation for deep-learning (multi-layered) CNN. To keep our equations simple (and readable), we revert back to only 1 input matrix. This CNN is similar to one used in part-II, except there are 2 convolution and ReLu layers.

The CNN we use is represented by:




In this CNN, there are 1 4x4 input matrices, two 2x2 filter matrices (or kernels), two convolution layers , two ReLu layers, a single pooling layer (which applies the MaxPool function) and a single fully connected (FC) layer. The elements of the filter matrices are equivalent to the unit weights in a standard NN and will be updated during the backpropagation phase.

Assuming a stride of 1 with no padding, the size of the convolution layers is determined by the following equation:

N=(WF+2.P)S+1
Here,
N: is the dimension (rows and columns) of the new(convolution) square matrix
W: is the dimension (rows and columns) of the input square matrix
F: is the dimension (rows and columns) of the filter square matrix
P: is the number of pixels(cells) of padding added
S: is the stride, or the number of cells skipped each time the kernel is slided

Putting these values in the equation for the 1st convolution layer,
N=(42+2.0)1+1=3
Hence the Convolution-1 convolution layer will have a matrix of size 3x3, which will serve as an input to the ReLu-1 layer with the same dimension. This 3x3 input layer will be the input matrix to the next downstream convolution, who's size is determined by the same equation. Hence for the 2nd convolution layer with a stride of 1,
N=(32+2.0)1+1=2
Therefore, the second convolution layer will be of size 2x2, which is also the size of the downstream ReLu layer.

The Error E is defined as E=(trfc)22, where t is given(expected) output in the training data.

We need to find out the change in E in relation to the changes in values of weights of the FC layer, wfc, as well as weights of the filters, w100 through w111 and, w00 through w11.

In part-II, we used the following CNN:


Downstream of the ReLu-1 layer, the architecture of this CNN is same as the one we use in this article. Therefore the derivations of Ewfc,Ew00,Ew10,Ew01 and Ew11 derived in part-II can be used here.

In part-II, we assumed that u10 is the result of the pooling layer (rp=u10), based on which the following was derived:

Ewfc=(trfc).(rp) ... eq.(1.1)Ew00=(trfc).(wfc).u10w00 ... eq.(1.2)Ew10=(trfc).(wfc).u10w10 ... eq.(1.3)Ew01=(trfc).(wfc).u10w00 ... eq.(1.4)Ew11=(trfc).(wfc).u10w11 ... eq.(1.5)
where,
u10=max(0,c10)and,c10=u110.w11+u120.w01+u111.w10+u121.w00hence,u10=max(0,u110.w11+u120.w01+u111.w10+u121.w00)
Notice that the values of c10 between part-II and here are different because the input matrix to the 2x2 convolution layer is not the same.

With values of Ewfc,Ew00,Ew10,Ew01 and Ew11 known from previous work in part-II, we now have to find values of Ew100,Ew110,Ew101 and Ew111.

Since u10 is the result of the pooling layer, E will only be impacted by values that directly affect the values of u10. The following diagram highlights (in purple) all values in the CNN that affect the value of u10.


Note that in this CNN, c10=u110.w11+u120.w01+u111.w10+u121.w00

Let us now compute Ew100,

Applying the chain rule,
xy=xz.zyHere, z=u10,x=E, and y=w100 ,therefore, Ew100=Eu10.u10w100 ... eq.(2.1)
Solving for the right term u10w100 in eq.(2.1):
u10w100=(max(0,c10))w100 ... eq.(2.2)
Now,
dmax(0,x)dx={x,if x > 00,if x < 0undetermined,if x = 0 ... eq.(2.4) Let us assume that x0. When x=0 there are several ways of dealing with it, including using a slightly modified ReLu function but that is outside the scope of this article. Let's now look at the two cases with x=c10,


Case 1: x<0,
In eq.(2.2),
(max(0,c10))w100=0w100 (from eq.(2.4))=0 ... eq.(2.5)
Case 2: x>0,
In eq.(2.2),
(max(0,c10))w100=c10w100 (from eq.(2.4)) ... eq.(2.6)
Solving for eq.(2.6),
c10w100=(u110.w11+u120.w01+u111.w10+u121.w00)w100 (substituting full value of c10)=(u110.w11)w100+(u120.w01)w100+(u111.w10)w100+(u121.w00)w100=(max(c110,0).w11)w100+(max(c120,0).w01)w100+(max(c111,0).w10)w100+(max(c121,0).w00)w100=w11.(max(c110,0))w100+w01.(max(c120,0))w100+w10.(max(c111,0))w100+w00.(max(c121,0))w100 ... eq.(2.7)
We will now have to expand the values of c110 through c121. These values are given as:


c110=i10.w111+i20.w101+i11.w110+i21.w100c120=i20.w111+i30.w101+i21.w110+i31.w100c111=i11.w111+i21.w101+i12.w110+i22.w100c121=i21.w111+i31.w101+i22.w110+i32.w100

Let us now find the partial derivative of each individual term in eq. (2.7), starting with w11.(max(c110,0))w100. There will be 2 cases depending on value of c110. We will not consider the case when c110=0.

Case 1: c110<0,

w11.(max(c110,0))w100=w11.0w100 (as,max(c110,0)=0)=0 ...eq.(2.7.1.1)

Case 2: c110>0,

w11.(max(c110,0))w100=w11.(i10.w111+i20.w101+i11.w110+i21.w100)w100 (as,max(c110,0)=c110, putting full value of c110)=w11.(i10.w111)w100+(i20.w101)w100+(i11.w110)w100+(i21.w100)w100=w11.0+0+0+i21=w11.i21 ...eq.(2.7.1.2)
Let us now find w01.(max(c120,0))w100, with the 2 cases.We will not consider the case when c120=0.

Case 1: c120<0,

w01.(max(c120,0))w100=w11.0w100 (as,max(c120,0)=0)=0 ...eq.(2.7.2.1)

Case 2: c120>0,

w01.(max(c120,0))w100=w01.(i20.w111+i30.w101+i21.w110+i31.w100)w100 (as,max(c120,0)=c120, putting full value of c120)=w01.(i20.w111)w100+(i30.w101)w100+(i21.w110)w100+(i31.w100)w100=w01.0+0+0+i31=w01.i31 ...eq.(2.7.2.2)
Let us now find w10.(max(c111,0))w100, with the 2 cases. We will not consider the case when c111=0.

Case 1: c111<0,

w10.(max(c111,0))w100=w10.0w100 (as,max(c111,0)=0)=0 ...eq.(2.7.3.1)

Case 2: c111>0,

w10.(max(c111,0))w100=w10.(i11.w111+i21.w101+i12.w110+i22.w100)w100 (as,max(c111,0)=c111, putting full value of c111)=w10.(i11.w111)w100+(i21.w101)w100+(i12.w110)w100+(i22.w100)w100=w10.0+0+0+i22=w10.i22 ...eq.(2.7.3.2)
Finally, let us find w00.(max(c121,0))w100, with the 2 cases. We will not consider the case when c121=0.

Case 1: c121<0,

w00.(max(c121,0))w100=w00.0w100 (as,max(c111,0)=0)=0 ...eq.(2.7.4.1)

Case 2: c121>0,

w00.(max(c121,0))w100=w00.(i21.w111+i31.w101+i22.w110+i32.w100)w100 (as,max(c121,0)=c121, putting full value of c121)=w00.(i21.w111)w100+(i31.w101)w100+(i22.w110)w100+(i32.w100)w100=w00.0+0+0+i32=w00.i32 ...eq.(2.7.4.2)
Putting the values of eqs.(2.7.x.x) in eq. (2.7),
c10w100=w11.(max(c110,0))w100+w01.(max(c120,0))w100+w10.(max(c111,0))w100+w00.(max(c121,0))w100 (from eq.(2.7))c10w100=w11.i21+w01.i31+w10.i22+w00.i32 (from eqs.(2.7.x.x)) ... eq.(2.8)
Note that eq.(2.8) assumes that c110,c120,c111,c121 are all >0. If any of c110,c120,c111,c121 are <0, then the corresponding multiple term (i21,i31,i22,i32) should be made 0.

Generic form of eq.(2.7),
c10w100=w11.(max(c110,0))w100+w01.(max(c120,0))w100+w10.(max(c111,0))w100+w00.(max(c121,0))w100 (from eq.(2.7))=w11.u110w100+w01.u120w100+w10.u111w100+w00.u121w100=w11.f(c110)w100+w01.f(c120)w100+w10.f(c111)w100+w00.f(u121)w100... eq.(2.9)where,f(c1xy) is the activation function (like ReLu etc.)
Going back to eq.(2.1),
Ew100=Eu10.u10w100 ... eq.(2.1)
Let us now solve for the left term of eq.(2.1), which is Eu10,

Eu10=(trfc)22u10 (putting value of E)=12.(trfc)2u10
Applying the chain rule to the above equation:
Here, z=trfc,x=z2, and y=u10 ,therefore, 12.(trfc)2u10=12.xy=12.xz.zy=12.z2z.zy=12.(trfc)2(trfc).(trfc)u10=12.2(trfc).(trfc)u10 ... eq.(2.1.1)
Further,
(trfc)u10=(twfc.rp)u10=tu10wfc.rpu10=tu10wfc.rpu10=tu10wfc.u10u10 as,(r[=u10)=0wfc (as,Cx=0 and xx=1,where C is a constant), ... eq.(2.1.2)
Applying eq.(2.1.2) in eq.(2.1.1),
12.(trfc)2u10=12.2(trfc).(trfc)u10=12.2(trfc).(0wfc)=(trfc).(wfc)=(trfc).wfc

Therefore,
Eu10=(trfc)22u10=12.(trfc)2u10=(trfc).wfc ... eq.(2.1.3)
Putting values of eqs.(2.1.3) and (2.9) in eq.(2.1),
Ew100=Eu10.u10w100(from eq.(2.1))=(trfc).wfc.u10w100 (putting value from eq.(2.1.3)) ... eq.(2.10)=(trfc).wfc.(w11.f(c110)w100+w01.f(c120)w100+w10.f(c111)w100+w00.f(u121)w100) ...eq.(2.11) (as,u10w100=c10w100 if c10>0, and putting value of c10w100from eq.(2.9))
Let us now compute Ew110, and after that we will come up with a generic form to derive Ew101 and Ew111,

Applying the chain rule,
xy=xz.zyHere, z=u10,x=E, and y=w110 ,therefore, Ew110=Eu10.u10w110 ... eq.(3.1)
Solving for the right term u10w110 in eq.(3.1):
u10w110=(max(0,c10))w110 ... eq.(3.2)
Now,
dmax(0,x)dx={x,if x > 00,if x < 0undetermined,if x = 0 ... eq.(3.4) Let us assume that x0. When x=0 there are several ways of dealing with it, including using a slightly modified ReLu function but that is outside the scope of this article. Let's now look at the two cases with x=c10,


Case 1: x<0,
In eq.(3.2),
(max(0,c10))w110=0w110 (from eq.(3.4))=0 ... eq.(3.5)
Case 2: x>0,
In eq.(3.2),
(max(0,c10))w100=c10w100 (from eq.(3.4)) ... eq.(3.6)
Solving for eq.(3.6),
c10w110=(u110.w11+u120.w01+u111.w10+u121.w00)w110 (substituting full value of c10)=(u110.w11)w110+(u120.w01)w110+(u111.w10)w110+(u121.w00)w110=(max(c110,0).w11)w110+(max(c120,0).w01)w110+(max(c111,0).w10)w110+(max(c121,0).w00)w110=w11.(max(c110,0))w110+w01.(max(c120,0))w110+w10.(max(c111,0))w110+w00.(max(c121,0))w110 ... eq.(3.7)
We already know the expanded values of c110 through c121. Let us now find the partial derivative of each individual term in eq.(3.7), starting with w11.(max(c110,0))w110. There will be 2 cases depending on value of c110. We will not consider the case when c110=0.

Case 1: c110<0,

w11.(max(c110,0))w100=w11.0w100 (as,max(c110,0)=0)=0 ...eq.(3.7.1.1)

Case 2: c110>0,

w11.(max(c110,0))w100=w11.(i10.w111+i20.w101+i11.w110+i21.w100)w110 (as,max(c110,0)=c110, putting full value of c110)=w11.(i10.w111)w110+(i20.w101)w110+(i11.w110)w110+(i21.w100)w110=w11.0+0+i11+0=w11.i11 ...eq.(3.7.1.2)
Let us now find w01.(max(c120,0))w110, with the 2 cases. We will not consider the case when c120=0.

Case 1: c120<0,

w01.(max(c120,0))w110=w11.0w110 (as,max(c120,0)=0)=0 ...eq.(3.7.2.1)

Case 2: c120>0,

w01.(max(c120,0))w110=w01.(i20.w111+i30.w101+i21.w110+i31.w100)w110 (as,max(c120,0)=c120, putting full value of c120)=w01.(i20.w111)w110+(i30.w101)w110+(i21.w110)w110+(i31.w100)w110=w01.0+0+i21+0=w01.i21 ...eq.(3.7.2.2)
Let us now find w10.(max(c111,0))w110, with the 2 cases. We will not consider the case when c111=0.

Case 1: c111<0,

w10.(max(c111,0))w110=w10.0w110 (as,max(c111,0)=0)=0 ...eq.(3.7.3.1)

Case 2: c111>0,

w10.(max(c111,0))w100=w10.(i11.w111+i21.w101+i12.w110+i22.w100)w110 (as,max(c111,0)=c111, putting full value of c111)=w10.(i11.w111)w110+(i21.w101)w110+(i12.w110)w110+(i22.w100)w110=w10.0+0+i12+0=w10.i12 ...eq.(3.7.3.2)
Finally, let us find w00.(max(c121,0))w110, with the 2 cases. We will not consider the case when c121=0.

Case 1: c121<0,

w00.(max(c121,0))w110=w00.0w110 (as,max(c111,0)=0)=0 ...eq.(3.7.4.1)

Case 2: c121>0,

w00.(max(c121,0))w110=w00.(i21.w111+i31.w101+i22.w110+i32.w100)w110 (as,max(c121,0)=c121, putting full value of c121)=w00.(i21.w111)w110+(i31.w101)w110+(i22.w110)w110+(i32.w100)w110=w00.0+0+i22+0=w00.i22 ...eq.(3.7.4.2)
Putting the values of eqs.(3.7.x.x) in eq. (3.7),
c10w110=w11.(max(c110,0))w110+w01.(max(c120,0))w110+w10.(max(c111,0))w110+w00.(max(c121,0))w110 (from eq.(3.7))c10w110=w11.i11+w01.i21+w10.i12+w00.i22 (from eqs.(3.7.x.x)) ... eq.(3.8)
Note that eq.(3.8) assumes that c110,c120,c111,c121 are all >0. If any of c110,c120,c111,c121 are <0, then the corresponding multiple term (i21,i31,i22,i32) should be made 0.

Generic form of eq.(3.7),
c10w110=w11.(max(c110,0))w110+w01.(max(c120,0))w110+w10.(max(c111,0))w110+w00.(max(c121,0))w110 (from eq.(3.7))=w11.u110w110+w01.u120w110+w10.u111w110+w00.u121w110=w11.f(c110)w110+w01.f(c120)w110+w10.f(c111)w110+w00.f(c121)w110... eq.(3.9)where,f(c1xy) is the activation function (like ReLu etc.)
Going back to eq.(3.1), we can use the same derivation as we used in eq.(2.1.3) as the term Eu10 is the same.
Eu10=(trfc)22u10(from eq.(2.1.3))=12.(trfc)2u10=(trfc).wfc ... eq.(3.1.3)
Putting values of eqs.(3.1.3) and (3.9) in eq.(3.1),
Ew110=Eu10.u10w110=(trfc).wfc.u10w110 (putting value from eq.(3.1.3)) ... eq.(3.10)=(trfc).wfc.(w11.f(c110)w110+w01.f(c120)w110+w10.f(c111)w110+w00.f(c121)w110) ... eq.(3.11) (putting value from eq.(3.9))
We will now come up with a general form based on which we will derive Ew101 and Ew111.

General form:
Listing eqs.(2.11) and (3.11),
Ew100=(trfc).wfc.(w11.f(c110)w100+w01.f(c120)w100+w10.f(c111)w100+w00.f(u121)w100) (from eq.(2.11))Ew110=(trfc).wfc.(w11.f(c110)w110+w01.f(c120)w110+w10.f(c111)w110+w00.f(c121)w110) (from eq.(3.11))
From these equations, it is clear that the only term different between them is f(c1xy)w110. Hence a generic form of eqs.(2.11) and (3.11) can be derived if f(c1xy)w110 is made generic.

Therefore the generic form of eqs.(2.11) and eqs.(3.11) is,
Ew1xy=(trfc).wfc.(w11.f(c110)w1xy+w01.f(c120)w1xy+w10.f(c111)w1xy+w00.f(c121)w1xy) ... eq.(4.1)(where,f(c1xy) is the activation function (like ReLu etc.))
Let us look at eqs.(2.8) and (3.8) to come up with a generic form when the activation function is ReLu,
c10w100=w11.i21+w01.i31+w10.i22+w00.i32 (from eq.(2.8))c10w110=w11.i11+w01.i21+w10.i12+w00.i22 (from eq.(3.8))
Intuitively we can see that the downstream weights w11 through w00 are multiplied with input values that the upstream weights w100 and w101 are multiplied with.

Consider the following CNN diagram which highlights (in brown) all inputs values that will be multiplied with w100 as the 2x2 filter is slided over the 4x4 input with a stride of 1 as the convolution operation is performed. Remember that as shown in part-I, when doing the convolution operation, we have to invert the filter matrix before doing the multiplication (dot product).


The following diagram shows how the same input values (which are multiplied by w100) are multiplied by the downstream filter values to give us c10w100=w11.i21+w01.i31+w10.i22+w00.i32. The matching colors between each element of the downstream filter matrix and the input matrix show which elements are multiplied.


Now consider the following CNN diagram that highlights (in brown) all inputs values that will be multiplied with w110.


The following diagram shows how the same input values (which are multiplied by w110) are multiplied by the downstream filter values to give us c10w110=w11.i11+w01.i21+w10.i12+w00.i22.


Let us extrapolate the pattern to find c10w101

Now consider the following CNN diagram that highlights (in brown) all inputs values that will be multiplied with w101,


The following diagram shows how the same input values (which are multiplied by w101) are multiplied by the downstream filter values to give us c10w101=w11.i20+w01.i30+w10.i21+w00.i31.


Similarly for c10w111, consider the following CNN diagram that highlights (in brown) all inputs values that will be multiplied with w111,


The following diagram shows how the same input values (which are multiplied by w111) are multiplied by the downstream filter values to give us c10w111=w11.i10+w01.i20+w10.i11+w00.i21.


Listing values of c10w100,c10w101,c10w110,c10w111, together,

c10w100=w11.i21+w01.i31+w10.i22+w00.i32c10w110=w11.i11+w01.i21+w10.i12+w00.i22c10w101=w11.i20+w01.i30+w10.i21+w00.i31c10w111=w11.i10+w01.i20+w10.i11+w00.i21
An unpure generic form of the above equations is,
c10w1xy=a=n,b=na=0,b=0wab.ipq ... eq.(4.2)(Here, wab represents all filter values of the downstream layer, and ipq is an element of the input matrix that is composed of elements that will be multiplied by w1xy.n+1 is the size of downstream filter matrix)
Listing eqs.(2.10) and (3.10)
Ew100=(trfc).wfc.u10w100 (from (eq.(2.10))Ew110=(trfc).wfc.u10w110 (from eq.(3.10))
Now since the activation function is ReLu, u10=max(c10,0), and assuming c10>0, eqs.(2.10) and (3.10) can be re-written as,
Ew100=(trfc).wfc.c10w100 (from (eq.(2.10))Ew110=(trfc).wfc.c10w110 (from eq.(3.10))
Or a more generic form,
Ew1xy=(trfc).wfc.c10w1xy ... eq.(4.3)Ew1xy=(trfc).wfc.a=n,b=na=0,b=0wab.ipq (from eq.(4.2)) ... eq.(4.4)
After backpropagation, each weight w1xy will be updated as,
Δw1xy=η.Ew1xy=η.((trfc).wfc.(a=n,b=na=0,b=0wab.ipq)) (from eq.(4.4))=η.((trfc).wfc.(a=n,b=na=0,b=0wab.ipq))w1xy=w1xy+Δw1xy
Similar equation to update weights w00 through w11 is provided in part-II.

In next part, we look at a CNN with multiple convolutional and ReLu layers in parallel.

Love or hate this article ? Have other ideas ? Please leave a comment below !


Comments

  1. This comment has been removed by a blog administrator.

    ReplyDelete

Post a Comment

Popular posts from this blog

Part I: Backpropagation mechanics for a Convolutional Neural Network

Introducing Convolution Neural Networks with a simple architecture

Python: Sockets Programming - Non-blocking Client