Part IV: Backpropagation mechanics for a Convolutional Neural Network
In part-III of this series, we derived the weight update equation for the backpropagation operation of a simple Convolutional Neural Network (CNN) using 3 inputs matrices representing the red, blue and green spectrum of an image. In this article, we derive the backpropagation with 2 consecutive convolution and ReLu layers, and subsequently derive a general backpropagation equation for deep-learning (multi-layered) CNN. To keep our equations simple (and readable), we revert back to only 1 input matrix. This CNN is similar to one used in part-II, except there are 2 convolution and ReLu layers.
The CNN we use is represented by:
In this CNN, there are 1 4x4 input matrices, two 2x2 filter matrices (or kernels), two convolution layers , two ReLu layers, a single pooling layer (which applies the MaxPool function) and a single fully connected (FC) layer. The elements of the filter matrices are equivalent to the unit weights in a standard NN and will be updated during the backpropagation phase.
Assuming a stride of 1 with no padding, the size of the convolution layers is determined by the following equation:
N=(W−F+2.P)S+1
Here,
N: is the dimension (rows and columns) of the new(convolution) square matrix
W: is the dimension (rows and columns) of the input square matrix
F: is the dimension (rows and columns) of the filter square matrix
P: is the number of pixels(cells) of padding added
S: is the stride, or the number of cells skipped each time the kernel is slided
Putting these values in the equation for the 1st convolution layer,
N=(4−2+2.0)1+1=3
Hence the Convolution-1 convolution layer will have a matrix of size 3x3, which will serve as an input to the ReLu-1 layer with the same dimension. This 3x3 input layer will be the input matrix to the next downstream convolution, who's size is determined by the same equation. Hence for the 2nd convolution layer with a stride of 1,
N=(3−2+2.0)1+1=2
Therefore, the second convolution layer will be of size 2x2, which is also the size of the downstream ReLu layer.
The Error E is defined as E=(t−rfc)22, where t is given(expected) output in the training data.
We need to find out the change in E in relation to the changes in values of weights of the FC layer, wfc, as well as weights of the filters, w100 through w111 and, w00 through w11.
In part-II, we used the following CNN:
Downstream of the ReLu-1 layer, the architecture of this CNN is same as the one we use in this article. Therefore the derivations of ∂E∂wfc,∂E∂w00,∂E∂w10,∂E∂w01 and ∂E∂w11 derived in part-II can be used here.
In part-II, we assumed that u10 is the result of the pooling layer (rp=u10), based on which the following was derived:
∂E∂wfc=(t−rfc).(−rp) ... eq.(1.1)∂E∂w00=(t−rfc).(−wfc).∂u10∂w00 ... eq.(1.2)∂E∂w10=(t−rfc).(−wfc).∂u10∂w10 ... eq.(1.3)∂E∂w01=(t−rfc).(−wfc).∂u10∂w00 ... eq.(1.4)∂E∂w11=(t−rfc).(−wfc).∂u10∂w11 ... eq.(1.5)
where,
u10=max(0,c10)and,c10=u110.w11+u120.w01+u111.w10+u121.w00hence,u10=max(0,u110.w11+u120.w01+u111.w10+u121.w00)
Notice that the values of c10 between part-II and here are different because the input matrix to the 2x2 convolution layer is not the same.
With values of ∂E∂wfc,∂E∂w00,∂E∂w10,∂E∂w01 and ∂E∂w11 known from previous work in part-II, we now have to find values of ∂E∂w100,∂E∂w110,∂E∂w101 and ∂E∂w111.
Since u10 is the result of the pooling layer, E will only be impacted by values that directly affect the values of u10. The following diagram highlights (in purple) all values in the CNN that affect the value of u10.
Note that in this CNN, c10=u110.w11+u120.w01+u111.w10+u121.w00
Let us now compute ∂E∂w100,
Applying the chain rule,
∂x∂y=∂x∂z.∂z∂yHere, z=u10,x=E, and y=w100 ,therefore, ∂E∂w100=∂E∂u10.∂u10∂w100 ... eq.(2.1)
Solving for the right term ∂u10∂w100 in eq.(2.1):
∂u10∂w100=∂(max(0,c10))∂w100 ... eq.(2.2)
Now,
dmax(0,x)dx={x,if x > 00,if x < 0undetermined,if x = 0 ... eq.(2.4) Let us assume that x≠0. When x=0 there are several ways of dealing with it, including using a slightly modified ReLu function but that is outside the scope of this article. Let's now look at the two cases with x=c10,
Case 1: x<0,
In eq.(2.2),
∂(max(0,c10))∂w100=∂0∂w100 (from eq.(2.4))=0 ... eq.(2.5)
Case 2: x>0,
In eq.(2.2),
∂(max(0,c10))∂w100=∂c10∂w100 (from eq.(2.4)) ... eq.(2.6)
Solving for eq.(2.6),
∂c10∂w100=∂(u110.w11+u120.w01+u111.w10+u121.w00)∂w100 (substituting full value of c10)=∂(u110.w11)∂w100+∂(u120.w01)∂w100+∂(u111.w10)∂w100+∂(u121.w00)∂w100=∂(max(c110,0).w11)∂w100+∂(max(c120,0).w01)∂w100+∂(max(c111,0).w10)∂w100+∂(max(c121,0).w00)∂w100=w11.∂(max(c110,0))∂w100+w01.∂(max(c120,0))∂w100+w10.∂(max(c111,0))∂w100+w00.∂(max(c121,0))∂w100 ... eq.(2.7)
We will now have to expand the values of c110 through c121. These values are given as:
c110=i10.w111+i20.w101+i11.w110+i21.w100c120=i20.w111+i30.w101+i21.w110+i31.w100c111=i11.w111+i21.w101+i12.w110+i22.w100c121=i21.w111+i31.w101+i22.w110+i32.w100
Let us now find the partial derivative of each individual term in eq. (2.7), starting with w11.∂(max(c110,0))∂w100. There will be 2 cases depending on value of c110. We will not consider the case when c110=0.
Case 1: c110<0,
w11.∂(max(c110,0))∂w100=w11.∂0∂w100 (as,max(c110,0)=0)=0 ...eq.(2.7.1.1)Case 2: c110>0,
w11.∂(max(c110,0))∂w100=w11.∂(i10.w111+i20.w101+i11.w110+i21.w100)∂w100 (as,max(c110,0)=c110, putting full value of c110)=w11.∂(i10.w111)∂w100+∂(i20.w101)∂w100+∂(i11.w110)∂w100+∂(i21.w100)∂w100=w11.0+0+0+i21=w11.i21 ...eq.(2.7.1.2)Let us now find w01.∂(max(c120,0))∂w100, with the 2 cases.We will not consider the case when c120=0.
Case 1: c120<0,
w01.∂(max(c120,0))∂w100=w11.∂0∂w100 (as,max(c120,0)=0)=0 ...eq.(2.7.2.1)Case 2: c120>0,
w01.∂(max(c120,0))∂w100=w01.∂(i20.w111+i30.w101+i21.w110+i31.w100)∂w100 (as,max(c120,0)=c120, putting full value of c120)=w01.∂(i20.w111)∂w100+∂(i30.w101)∂w100+∂(i21.w110)∂w100+∂(i31.w100)∂w100=w01.0+0+0+i31=w01.i31 ...eq.(2.7.2.2)Let us now find w10.∂(max(c111,0))∂w100, with the 2 cases. We will not consider the case when c111=0.
Case 1: c111<0,
w10.∂(max(c111,0))∂w100=w10.∂0∂w100 (as,max(c111,0)=0)=0 ...eq.(2.7.3.1)Case 2: c111>0,
w10.∂(max(c111,0))∂w100=w10.∂(i11.w111+i21.w101+i12.w110+i22.w100)∂w100 (as,max(c111,0)=c111, putting full value of c111)=w10.∂(i11.w111)∂w100+∂(i21.w101)∂w100+∂(i12.w110)∂w100+∂(i22.w100)∂w100=w10.0+0+0+i22=w10.i22 ...eq.(2.7.3.2)Finally, let us find w00.∂(max(c121,0))∂w100, with the 2 cases. We will not consider the case when c121=0.
Case 1: c121<0,w00.∂(max(c121,0))∂w100=w00.∂0∂w100 (as,max(c111,0)=0)=0 ...eq.(2.7.4.1)
Case 2: c121>0,
w00.∂(max(c121,0))∂w100=w00.∂(i21.w111+i31.w101+i22.w110+i32.w100)∂w100 (as,max(c121,0)=c121, putting full value of c121)=w00.∂(i21.w111)∂w100+∂(i31.w101)∂w100+∂(i22.w110)∂w100+∂(i32.w100)∂w100=w00.0+0+0+i32=w00.i32 ...eq.(2.7.4.2)Putting the values of eqs.(2.7.x.x) in eq. (2.7),
∂c10∂w100=w11.∂(max(c110,0))∂w100+w01.∂(max(c120,0))∂w100+w10.∂(max(c111,0))∂w100+w00.∂(max(c121,0))∂w100 (from eq.(2.7))∂c10∂w100=w11.i21+w01.i31+w10.i22+w00.i32 (from eqs.(2.7.x.x)) ... eq.(2.8)
Note that eq.(2.8) assumes that c110,c120,c111,c121 are all >0. If any of c110,c120,c111,c121 are <0, then the corresponding multiple term (i21,i31,i22,i32) should be made 0.
Generic form of eq.(2.7),
∂c10∂w100=w11.∂(max(c110,0))∂w100+w01.∂(max(c120,0))∂w100+w10.∂(max(c111,0))∂w100+w00.∂(max(c121,0))∂w100 (from eq.(2.7))=w11.∂u110∂w100+w01.∂u120∂w100+w10.∂u111∂w100+w00.∂u121∂w100=w11.∂f(c110)∂w100+w01.∂f(c120)∂w100+w10.∂f(c111)∂w100+w00.∂f(u121)∂w100... eq.(2.9)where,f(c1xy) is the activation function (like ReLu etc.)
Going back to eq.(2.1),
∂E∂w100=∂E∂u10.∂u10∂w100 ... eq.(2.1)
Let us now solve for the left term of eq.(2.1), which is ∂E∂u10,
∂E∂u10=∂(t−rfc)22∂u10 (putting value of E)=12.∂(t−rfc)2∂u10
Applying the chain rule to the above equation:
Here, z=t−rfc,x=z2, and y=u10 ,therefore, 12.∂(t−rfc)2∂u10=12.∂x∂y=12.∂x∂z.∂z∂y=12.∂z2∂z.∂z∂y=12.∂(t−rfc)2∂(t−rfc).∂(t−rfc)∂u10=12.2(t−rfc).∂(t−rfc)∂u10 ... eq.(2.1.1)
Further,
∂(t−rfc)∂u10=∂(t−wfc.rp)∂u10=∂t∂u10−∂wfc.rp∂u10=∂t∂u10−wfc.∂rp∂u10=∂t∂u10−wfc.∂u10∂u10 as,(r[=u10)=0−wfc (as,∂C∂x=0 and ∂x∂x=1,where C is a constant), ... eq.(2.1.2)
Applying eq.(2.1.2) in eq.(2.1.1),
12.∂(t−rfc)2∂u10=12.2(t−rfc).∂(t−rfc)∂u10=12.2(t−rfc).(0−wfc)=(t−rfc).(−wfc)=−(t−rfc).wfc
Therefore,
∂E∂u10=∂(t−rfc)22∂u10=12.∂(t−rfc)2∂u10=−(t−rfc).wfc ... eq.(2.1.3)
Putting values of eqs.(2.1.3) and (2.9) in eq.(2.1),
∂E∂w100=∂E∂u10.∂u10∂w100(from eq.(2.1))=−(t−rfc).wfc.∂u10∂w100 (putting value from eq.(2.1.3)) ... eq.(2.10)=−(t−rfc).wfc.(w11.∂f(c110)∂w100+w01.∂f(c120)∂w100+w10.∂f(c111)∂w100+w00.∂f(u121)∂w100) ...eq.(2.11) (as,∂u10∂w100=∂c10∂w100 if c10>0, and putting value of ∂c10∂w100from eq.(2.9))
Let us now compute ∂E∂w110, and after that we will come up with a generic form to derive ∂E∂w101 and ∂E∂w111,
Applying the chain rule,
∂x∂y=∂x∂z.∂z∂yHere, z=u10,x=E, and y=w110 ,therefore, ∂E∂w110=∂E∂u10.∂u10∂w110 ... eq.(3.1)
Solving for the right term ∂u10∂w110 in eq.(3.1):
∂u10∂w110=∂(max(0,c10))∂w110 ... eq.(3.2)
Now,
dmax(0,x)dx={x,if x > 00,if x < 0undetermined,if x = 0 ... eq.(3.4) Let us assume that x≠0. When x=0 there are several ways of dealing with it, including using a slightly modified ReLu function but that is outside the scope of this article. Let's now look at the two cases with x=c10,
Case 1: x<0,
In eq.(3.2),
∂(max(0,c10))∂w110=∂0∂w110 (from eq.(3.4))=0 ... eq.(3.5)
Case 2: x>0,
In eq.(3.2),
∂(max(0,c10))∂w100=∂c10∂w100 (from eq.(3.4)) ... eq.(3.6)
Solving for eq.(3.6),
∂c10∂w110=∂(u110.w11+u120.w01+u111.w10+u121.w00)∂w110 (substituting full value of c10)=∂(u110.w11)∂w110+∂(u120.w01)∂w110+∂(u111.w10)∂w110+∂(u121.w00)∂w110=∂(max(c110,0).w11)∂w110+∂(max(c120,0).w01)∂w110+∂(max(c111,0).w10)∂w110+∂(max(c121,0).w00)∂w110=w11.∂(max(c110,0))∂w110+w01.∂(max(c120,0))∂w110+w10.∂(max(c111,0))∂w110+w00.∂(max(c121,0))∂w110 ... eq.(3.7)
We already know the expanded values of c110 through c121. Let us now find the partial derivative of each individual term in eq.(3.7), starting with w11.∂(max(c110,0))∂w110. There will be 2 cases depending on value of c110. We will not consider the case when c110=0.
Case 1: c110<0,
w11.∂(max(c110,0))∂w100=w11.∂0∂w100 (as,max(c110,0)=0)=0 ...eq.(3.7.1.1)Case 2: c110>0,
w11.∂(max(c110,0))∂w100=w11.∂(i10.w111+i20.w101+i11.w110+i21.w100)∂w110 (as,max(c110,0)=c110, putting full value of c110)=w11.∂(i10.w111)∂w110+∂(i20.w101)∂w110+∂(i11.w110)∂w110+∂(i21.w100)∂w110=w11.0+0+i11+0=w11.i11 ...eq.(3.7.1.2)Let us now find w01.∂(max(c120,0))∂w110, with the 2 cases. We will not consider the case when c120=0.
Case 1: c120<0,
w01.∂(max(c120,0))∂w110=w11.∂0∂w110 (as,max(c120,0)=0)=0 ...eq.(3.7.2.1)Case 2: c120>0,
w01.∂(max(c120,0))∂w110=w01.∂(i20.w111+i30.w101+i21.w110+i31.w100)∂w110 (as,max(c120,0)=c120, putting full value of c120)=w01.∂(i20.w111)∂w110+∂(i30.w101)∂w110+∂(i21.w110)∂w110+∂(i31.w100)∂w110=w01.0+0+i21+0=w01.i21 ...eq.(3.7.2.2)Let us now find w10.∂(max(c111,0))∂w110, with the 2 cases. We will not consider the case when c111=0.
Case 1: c111<0,
w10.∂(max(c111,0))∂w110=w10.∂0∂w110 (as,max(c111,0)=0)=0 ...eq.(3.7.3.1)Case 2: c111>0,
w10.∂(max(c111,0))∂w100=w10.∂(i11.w111+i21.w101+i12.w110+i22.w100)∂w110 (as,max(c111,0)=c111, putting full value of c111)=w10.∂(i11.w111)∂w110+∂(i21.w101)∂w110+∂(i12.w110)∂w110+∂(i22.w100)∂w110=w10.0+0+i12+0=w10.i12 ...eq.(3.7.3.2)Finally, let us find w00.∂(max(c121,0))∂w110, with the 2 cases. We will not consider the case when c121=0.
Case 1: c121<0,w00.∂(max(c121,0))∂w110=w00.∂0∂w110 (as,max(c111,0)=0)=0 ...eq.(3.7.4.1)
Case 2: c121>0,
w00.∂(max(c121,0))∂w110=w00.∂(i21.w111+i31.w101+i22.w110+i32.w100)∂w110 (as,max(c121,0)=c121, putting full value of c121)=w00.∂(i21.w111)∂w110+∂(i31.w101)∂w110+∂(i22.w110)∂w110+∂(i32.w100)∂w110=w00.0+0+i22+0=w00.i22 ...eq.(3.7.4.2)Putting the values of eqs.(3.7.x.x) in eq. (3.7),
∂c10∂w110=w11.∂(max(c110,0))∂w110+w01.∂(max(c120,0))∂w110+w10.∂(max(c111,0))∂w110+w00.∂(max(c121,0))∂w110 (from eq.(3.7))∂c10∂w110=w11.i11+w01.i21+w10.i12+w00.i22 (from eqs.(3.7.x.x)) ... eq.(3.8)
Note that eq.(3.8) assumes that c110,c120,c111,c121 are all >0. If any of c110,c120,c111,c121 are <0, then the corresponding multiple term (i21,i31,i22,i32) should be made 0.
Generic form of eq.(3.7),
∂c10∂w110=w11.∂(max(c110,0))∂w110+w01.∂(max(c120,0))∂w110+w10.∂(max(c111,0))∂w110+w00.∂(max(c121,0))∂w110 (from eq.(3.7))=w11.∂u110∂w110+w01.∂u120∂w110+w10.∂u111∂w110+w00.∂u121∂w110=w11.∂f(c110)∂w110+w01.∂f(c120)∂w110+w10.∂f(c111)∂w110+w00.∂f(c121)∂w110... eq.(3.9)where,f(c1xy) is the activation function (like ReLu etc.)
Going back to eq.(3.1), we can use the same derivation as we used in eq.(2.1.3) as the term ∂E∂u10 is the same.
∂E∂u10=∂(t−rfc)22∂u10(from eq.(2.1.3))=12.∂(t−rfc)2∂u10=−(t−rfc).wfc ... eq.(3.1.3)
Putting values of eqs.(3.1.3) and (3.9) in eq.(3.1),
∂E∂w110=∂E∂u10.∂u10∂w110=−(t−rfc).wfc.∂u10∂w110 (putting value from eq.(3.1.3)) ... eq.(3.10)=−(t−rfc).wfc.(w11.∂f(c110)∂w110+w01.∂f(c120)∂w110+w10.∂f(c111)∂w110+w00.∂f(c121)∂w110) ... eq.(3.11) (putting value from eq.(3.9))
We will now come up with a general form based on which we will derive ∂E∂w101 and ∂E∂w111.
General form:
Listing eqs.(2.11) and (3.11),
∂E∂w100=−(t−rfc).wfc.(w11.∂f(c110)∂w100+w01.∂f(c120)∂w100+w10.∂f(c111)∂w100+w00.∂f(u121)∂w100) (from eq.(2.11))∂E∂w110=−(t−rfc).wfc.(w11.∂f(c110)∂w110+w01.∂f(c120)∂w110+w10.∂f(c111)∂w110+w00.∂f(c121)∂w110) (from eq.(3.11))
From these equations, it is clear that the only term different between them is ∂f(c1xy)∂w110. Hence a generic form of eqs.(2.11) and (3.11) can be derived if ∂f(c1xy)∂w110 is made generic.
Therefore the generic form of eqs.(2.11) and eqs.(3.11) is,
∂E∂w1xy=−(t−rfc).wfc.(w11.∂f(c110)∂w1xy+w01.∂f(c120)∂w1xy+w10.∂f(c111)∂w1xy+w00.∂f(c121)∂w1xy) ... eq.(4.1)(where,f(c1xy) is the activation function (like ReLu etc.))
Let us look at eqs.(2.8) and (3.8) to come up with a generic form when the activation function is ReLu,
∂c10∂w100=w11.i21+w01.i31+w10.i22+w00.i32 (from eq.(2.8))∂c10∂w110=w11.i11+w01.i21+w10.i12+w00.i22 (from eq.(3.8))
Intuitively we can see that the downstream weights w11 through w00 are multiplied with input values that the upstream weights w100 and w101 are multiplied with.
Consider the following CNN diagram which highlights (in brown) all inputs values that will be multiplied with w100 as the 2x2 filter is slided over the 4x4 input with a stride of 1 as the convolution operation is performed. Remember that as shown in part-I, when doing the convolution operation, we have to invert the filter matrix before doing the multiplication (dot product).
The following diagram shows how the same input values (which are multiplied by w100) are multiplied by the downstream filter values to give us ∂c10∂w100=w11.i21+w01.i31+w10.i22+w00.i32. The matching colors between each element of the downstream filter matrix and the input matrix show which elements are multiplied.
Now consider the following CNN diagram that highlights (in brown) all inputs values that will be multiplied with w110.
The following diagram shows how the same input values (which are multiplied by w110) are multiplied by the downstream filter values to give us ∂c10∂w110=w11.i11+w01.i21+w10.i12+w00.i22.
Let us extrapolate the pattern to find ∂c10∂w101
Now consider the following CNN diagram that highlights (in brown) all inputs values that will be multiplied with w101,
The following diagram shows how the same input values (which are multiplied by w101) are multiplied by the downstream filter values to give us ∂c10∂w101=w11.i20+w01.i30+w10.i21+w00.i31.
Similarly for ∂c10∂w111, consider the following CNN diagram that highlights (in brown) all inputs values that will be multiplied with w111,
The following diagram shows how the same input values (which are multiplied by w111) are multiplied by the downstream filter values to give us ∂c10∂w111=w11.i10+w01.i20+w10.i11+w00.i21.
Listing values of ∂c10∂w100,∂c10∂w101,∂c10∂w110,∂c10∂w111, together,
∂c10∂w100=w11.i21+w01.i31+w10.i22+w00.i32∂c10∂w110=w11.i11+w01.i21+w10.i12+w00.i22∂c10∂w101=w11.i20+w01.i30+w10.i21+w00.i31∂c10∂w111=w11.i10+w01.i20+w10.i11+w00.i21
An unpure generic form of the above equations is,
∂c10∂w1xy=a=n,b=n∑a=0,b=0wab.ipq ... eq.(4.2)(Here, wab represents all filter values of the downstream layer, and ipq is an element of the input matrix that is composed of elements that will be multiplied by w1xy.n+1 is the size of downstream filter matrix)
Listing eqs.(2.10) and (3.10)
∂E∂w100=−(t−rfc).wfc.∂u10∂w100 (from (eq.(2.10))∂E∂w110=−(t−rfc).wfc.∂u10∂w110 (from eq.(3.10))
Now since the activation function is ReLu, u10=max(c10,0), and assuming c10>0, eqs.(2.10) and (3.10) can be re-written as,
∂E∂w100=−(t−rfc).wfc.∂c10∂w100 (from (eq.(2.10))∂E∂w110=−(t−rfc).wfc.∂c10∂w110 (from eq.(3.10))
Or a more generic form,
∂E∂w1xy=−(t−rfc).wfc.∂c10∂w1xy ... eq.(4.3)∂E∂w1xy=−(t−rfc).wfc.a=n,b=n∑a=0,b=0wab.ipq (from eq.(4.2)) ... eq.(4.4)
After backpropagation, each weight w1xy will be updated as,
Δw1xy=−η.∂E∂w1xy=−η.(−(t−rfc).wfc.(a=n,b=n∑a=0,b=0wab.ipq)) (from eq.(4.4))=η.((t−rfc).wfc.(a=n,b=n∑a=0,b=0wab.ipq))⇒w1xy=w1xy+Δw1xy
Similar equation to update weights w00 through w11 is provided in part-II.
In next part, we look at a CNN with multiple convolutional and ReLu layers in parallel.
Love or hate this article ? Have other ideas ? Please leave a comment below !
This comment has been removed by a blog administrator.
ReplyDelete