Part III: Backpropagation mechanics for a Convolutional Neural Network
In part-II of this article, we derived the weight update equation for the backpropagation operation of a simple Convolutional Neural Network (CNN). The input for the CNN considered in part-II is a grayscale image, hence, the input is in the form of a single 4x4 matrix. CNNs used in practice however, use color images where each of the Red, Green and Blue (RGB) color spectrums serve as input. Hence, a CNN with the inputs color coded with their respective spectrums look like:
In this article we expand upon the original CNN where the input is represented by 3, 4x4 matrices each presenting the pixel intensities for the RGB color spectrum. Hence is we represent each of the pixels with the CNN given is represented by:
In this CNN, there are 3, 4x4 input matrices, one 2x2 filter matrix (also known as kernel), a single convolution layer with 1 unit, a single ReLu layer, a single pooling layer (which applies the MaxPool function) and a single fully connected (FC) layer. The elements of the filter matrix are equivalent to the unit weights in a standard NN and will be updated during the backpropagation phase.
Assuming a stride of 2 with no padding, the size of the convolution layer is determined by the following equation:
N=(W−F+2.P)S+1
Here,
N: is the dimension (rows and columns) of the new(convolution) square matrix
W: is the dimension (rows and columns) of the input square matrix
F: is the dimension (rows and columns) of the filter square matrix
P: is the number of pixels(cells) of padding added
S: is the stride, or the number of cells skipped each time the kernel is slided
Putting these values in the equation,
N=(4−2+2.0)2+1=2
Hence the convolution layer will have a matrix of size 2x2.
It is interesting to note that downstream of the convolutional layer, the CNN has the same architecture as the one we used in part-II. Hence, the backpropagation equations will be same, except values of c00,c10,c10,c11 will be different, as they now include the values of all the 3 matrices. These values are given below:
c00=(r00.w11+r10.w01+r01.w10+r11.w00)+(g00.w11+g10.w01+g01.w10+g11.w00)+(b00.w11+b10.w01+b01.w10+b11.w00)c10=(r20.w11+r30.w01+r21.w10+r31.w00)+(g20.w11+g30.w01+g21.w10+g31.w00)+(b20.w11+b30.w01+b21.w10+b31.w00)c01=(r02.w11+r12.w01+r03.w10+r13.w00)+(g02.w11+g12.w01+g03.w10+g13.w00)+(b02.w11+b12.w01+b03.w10+b13.w00)c11=(r22.w11+r32.w01+r23.w10+r33.w00)+(g22.w11+g32.w01+i23.w10+g33.w00)+(b22.w11+b32.w01+b23.w10+b33.w00)
Let us now derive the backpropagation.
The Error E is defined as E=(t−rfc)22, where t is given(expected) output in the training data.
We need find out change in E in relation to the changes in values of weights wfc and w00 through w11 given in the filter.
Let us first find out how E changes in relation to wfc or the rate of change of E w.r.t. wfc. This is also called the derivative of E w.r.t wfc or ∂E∂wfc
E=(t−rfc)22∂E∂wfc=∂(t−rfc)22∂wfc=12.∂(t−rfc)2∂wfc ... eq.(1.1)
As per the chain rule,:
dxdy=dxdz.dzdy
For a partial derivative:
∂x∂y=∂x∂z.∂z∂y
Applying the chain rule to right term of eq.(1.1):
Here, z=t−rfc,x=z2, and y=wfc ,therefore, 12.∂(t−rfc)2∂wfc=12.∂x∂y=12.∂x∂z.∂z∂y=12.∂z2∂z.∂z∂y=12.∂(t−rfc)2∂(t−rfc).∂(t−rfc)∂wfc=12.2(t−rfc).∂(t−rfc)∂wfc ... eq.(1.2)
Further,
∂(t−rfc)∂wfc=∂(t−wfc.rp)∂wfc=∂t∂wfc−∂wfc.rp∂wfc=∂t∂wfc−rp.∂wfc∂wfc=0−rp (as,∂C∂x=0 and ∂x∂x=1,where C is a constant),
Based on the above, in eq.(1.2),
12.∂(t−rfc)2∂wfc=12.2(t−rfc).∂(t−rfc)∂wfc=12.2(t−rfc).(0−rp)=(t−rfc).(−rp)
Therefore,
∂E∂wfc=∂(t−rfc)22∂wfc=12.∂(t−rfc)2∂wfc=(t−rfc).(−rp) ... eq.(1.3)
Further upstream the only variables that control the value of rfc are in the filter F. Hence, we need to find out how E changes in relation to F or ∂E∂F.
Since the pooling layer applies the MaxPool function, output rp of pooling layer is given as:
Hence, the values of u00,u10,u10,u11 are given as (see CNN article for more details):
u00=max(0,c00)=max(0,(r00.w11+r10.w01+r01.w10+r11.w00)+(g00.w11+g10.w01+g01.w10+g11.w00)+(b00.w11+b10.w01+b01.w10+b11.w00))u10=max(0,c10)=max(0,(r20.w11+r30.w01+r21.w10+r31.w00)+(g20.w11+g30.w01+g21.w10+g31.w00)+(b20.w11+b30.w01+b21.w10+b31.w00))u01=max(0,c01)=max(0,(r02.w11+r12.w01+r03.w10+r13.w00)+(g02.w11+g12.w01+g03.w10+g13.w00)+(b02.w11+b12.w01+b03.w10+b13.w00))u11=max(0,c11)=max(0,(r22.w11+r32.w01+r23.w10+r33.w00)+(g22.w11+g32.w01+i23.w10+g33.w00)+(b22.w11+b32.w01+b23.w10+b33.w00))
Note that for the sake of simplicity, we do not normalize values of c00,c10,c10,c11.
When you implement a CNN, the pooling layer has to remember the indices of the convolution layer filter maps that were used as input to the pooling layer. This is the approach that has been used in the Dasmic machine learning library. In this case, let us just assume that u10 was the output of the pooling layer, or:
u10=max{u00,u10,u10,u11}⇒rp=u10 ... eq.(1.4)
To compute ∂E∂F, we will need to find out how E changes in relation to each element of matrix F, or we need to find,∂Ew00,∂Ew10,∂Ew01,∂Ew11.
Let us now compute ∂E∂w00,
Applying the chain rule,
∂x∂y=∂x∂z.∂z∂yHere, z=u10,x=E, and y=w00 ,therefore, ∂E∂w00=∂E∂u10.∂u10∂w00 ... eq.(2.1)
Solve for the right term ∂u10∂w00 in eq.(2.1):
∂u10∂w00=∂(max(0,c10))∂w00 ... eq.(2.2)
Applying the chain rule to eq.(2.2),
∂x∂y=∂x∂z.∂z∂yHere, z=c10,x=max(0,c10), and y=w00 ,therefore, ∂(max(0,c10))∂w00=∂max(0,c10)∂c10.∂c10∂w00 ... eq.(2.3)
Now,
dmax(0,x)dx={x,if x > 00,if x < 0undetermined,if x = 0 ... eq.(2.4) Let us assume that x≠0. When x=0 there are several ways of dealing with it including using a slightly modified ReLu function, but that is outside the scope of this article. Let's now look at the two cases when x=c10:
Case 1: x<0,
In eq.(2.3),∂(max(0,c10))∂w00=∂max(0,c10)∂c10.∂c10∂w00=∂0∂c10.∂c10∂w00 (applying value of max(0,c10) from eq.(2.4))=0.∂c10∂w00=0 ... eq.(2.5)
Putting value from eq.(2.5) in eq.(2.2),
∂u10∂w00=∂(max(0,c10))∂w00=0
Putting value of ∂u10∂w00 in eq.(2.1),
∂E∂w00=∂E∂u10.∂u10∂w00=∂E∂u10.0=0
Case 2: x>0,
In eq.(2.3),
∂(max(0,c10))∂w00=∂max(0,c10)∂c10.∂c10∂w00=∂c10∂c10.∂c10∂w00 (applying value of max(0,c10)=c10 from eq.(2.3))=1.∂c10∂w00=∂c10∂w00 ... eq.(2.6)
Solving for eq.(2.6),
∂c10∂w00=∂((r20.w11+r30.w01+r21.w10+r31.w00)+(g20.w11+g30.w01+g21.w10+g31.w00)+(b20.w11+b30.w01+b21.w10+b31.w00))∂w00 (substituting full value of c10)=∂(r20.w11)∂w00+∂(r30.w01)∂w00+∂(r21.w10)∂w00+∂(r31.w00)∂w00+∂(g20.w11)∂w00+∂(g30.w01)∂w00+∂(g21.w10)∂w00+∂(g31.w00)∂w00+∂(b20.w11)∂w00+∂(b30.w01)∂w00+∂(b21.w10)∂w00+∂(b31.w00)∂w00=∂(r20.w11)∂w00+∂(r30.w01)∂w00+∂(r21.w10)∂w00+r31.∂w00∂w00+∂(g20.w11)∂w00+∂(g30.w01)∂w00+∂(g21.w10)∂w00+g31.∂w00∂w00+∂(b20.w11)∂w00+∂(b30.w01)∂w00+∂(b21.w10)∂w00+b31.∂w00∂w00 ... eq.(2.7)
Note that the terms (r20.w11),(r30.w01),(r21.w10),(g20.w11),(g30.w01),(g21.w10),(b20.w11),(b30.w01) and (b21.w10) will not change w.r.t w00 hence, can be treated as a constant C
As ∂C∂x=0, in eq.(2.7),
∂c10∂w00=∂(r20.w11)∂w00+∂(r30.w01)∂w00+∂(r21.w10)∂w00+r31.∂w00∂w00+∂(g20.w11)∂w00+∂(g30.w01)∂w00+∂(g21.w10)∂w00+g31.∂w00∂w00+∂(b20.w11)∂w00+∂(b30.w01)∂w00+∂(b21.w10)∂w00+b31.∂w00∂w00=0+0+0+r31.∂w00∂w00+0+0+0+g31.∂w00∂w00+0+0+0+b31.∂w00∂w00=0+0+0+r31+0+0+0+g31+0+0+0+b31(as, ∂x∂x=0)=r31+g31+b31 ... eq.(2.8)
Putting value from eq.(2.8) in eq.(2.2),
∂u10∂w00=∂(max(0,c10))∂w00=∂c10∂w00 (from eq.(2.6))=r31+g31+b31 (from eq.(2.8)) ... eq.(2.10)
Let us now find the left term of eq. (2.1) ∂E∂u10. As,
E=(t−rfc)22, where t is given(expected) output in the training data,
∂E∂u10=∂(t−rfc)22∂u10=12.∂(t−rfc)2∂u10 ... eq.(2.11)
For a partial derivative, the chain rule is:
∂x∂y=∂x∂z.∂z∂y
Applying the chain rule to right term of eq.(2.11):
Here, z=t−rfc,x=z2, and y=u10 ,therefore, 12.∂(t−rfc)2∂u10=12.∂x∂y=12.∂x∂z.∂z∂y=12.∂z2∂z.∂z∂y=12.∂(t−rfc)2∂(t−rfc).∂(t−rfc)∂u10=12.2(t−rfc).∂(t−rfc)∂u10 ... eq.(2.12)
Further,
∂(t−rfc)∂u10=∂(t−wfc.rp)∂u10=∂t∂u10−∂wfc.rp∂u10 from eq.(1.4) we know rp=u10,therefore,∂(t−rfc)∂u10=∂t∂u10−wfc.∂u10∂u10=0−wfc (as,∂C∂x=0 and ∂x∂x=1,where C is a constant),
Applying value of ∂(t−rfc)∂u10 in eq(2.12),
12.∂(t−rfc)2∂u10=12.2(t−rfc).∂(t−rfc)∂u10=12.2(t−rfc).(0−wfc)=(t−rfc).(0−wfc)
Applying value of 12.∂(t−rfc)2∂u11 in eq.(2.11),
∂E∂u10=∂(t−rfc)22∂u10=(t−rfc).(0−wfc)=(t−rfc).(−wfc) ... eq.(2.13)
Applying values from eq.(2.8) and eq.(2.13) in eq. (2.1),
∂E∂w00=∂E∂u10.∂u10∂w00=∂E∂u10.(i31) (from eq.(2.10))=(t−rfc).(−wfc).(r31+g31+b31) (from eq.(2.5)) ... eq. (2.14)
Eq.(2.14) can also be converted to a general equation that covers both cases 1 and 2. This general equation is also useful if we use a different activation function other than ReLu. This general form of eq. (2.14) is:
∂E∂w00=∂E∂u10.∂u10∂w00=(t−rfc).(−wfc).∂u10∂w00=(t−rfc).(−wfc).∂f(c10)∂w00 ... eq.(2.15)
Here, f(x) is the activation function.
Using similar steps as for ∂E∂w00 let us now compute ∂E∂w10,
Applying the chain rule as before,
∂x∂y=∂x∂z.∂z∂yHere, z=u10,x=E, and y=w10 ,therefore, ∂E∂w10=∂E∂u10.∂u10∂w10 ... eq.(3.1)
Solve for the right term ∂u10∂w10 in eq.(3.1):
∂u10∂w10=∂(max(0,c10))∂w10 ... eq.(3.2)
Applying the chain rule to eq.(3.2),
∂x∂y=∂x∂z.∂z∂yHere, z=c10,x=max(0,c10), and y=w10 ,therefore, ∂(max(0,c10))∂w10=∂max(0,c10)∂c10.∂c10∂w10 ... eq.(3.3)
We already know,
dmax(0,x)dx={1,if x > 00,if x < 0undetermined,if x = 0 ... eq.(3.4)
Case 1: x<0,
In eq.(3.3),
∂(max(0,c10))∂w01=∂max(0,c10)∂c10.∂c10∂w10=∂0∂c10.∂c10∂w10 (applying value of max(0,c10) from eq.(3.4))=0.∂c10∂w10=0 ... eq.(3.5)
Putting value from eq.(3.5) in eq.(3.2),
∂u10∂w10=∂(max(0,c10))∂w10=0
Putting value of ∂u10∂w10 in eq.(3.1),
∂E∂w10=∂E∂u10.∂u10∂w10=∂E∂u10.0=0
Case 2: x>0,
In eq.(3.3),
∂(max(0,c10))∂w10=∂max(0,c10)∂c10.∂c10∂w10=∂c10∂c10.∂c10∂w10 (applying value of max(0,c10)=c10 from eq.(3.3))=1.∂c10∂w10=∂c10∂w10 ... eq.(3.6)
Solving for eq.(3.6),
∂c10∂w10=∂((r20.w11+r30.w01+r21.w10+r31.w00)+(g20.w11+g30.w01+g21.w10+g31.w00)+(b20.w11+b30.w01+b21.w10+b31.w00))∂w10 (substituting full value of c10)=∂(r20.w11)∂w10+∂(r30.w01)∂w10+∂(r21.w10)∂w10+∂(r31.w00)∂w10+∂(g20.w11)∂w10+∂(g30.w01)∂w10+∂(g21.w10)∂w10+∂(g31.w00)∂w10+∂(b20.w11)∂w10+∂(b30.w01)∂w10+∂(b21.w10)∂w10+∂(b31.w00)∂w10=∂(r20.w11)∂w10+∂(r30.w01)∂w10+r21.∂w10∂w10+∂(r31.w00)∂w10+∂(g20.w11)∂w10+∂(g30.w01)∂w10+g21.∂w10∂w10+∂(g31.w00)∂w10+∂(b20.w11)∂w10+∂(b30.w01)∂w10+b21.∂w10∂w10+∂(b31.w00)∂w10 ... eq.(2.7)
Note that the terms (r20.w11),(r30.w01),(r31.w10),(g20.w11),(g30.w01),(g31.w10),(b20.w11),(b30.w01) and (b31.w10) will not change w.r.t w00 hence, can be treated as a constant C
As ∂C∂x=0, in eq.(3.7),
∂c10∂w10=∂(r20.w11)∂w10+∂(r30.w01)∂w10+r21.∂w10∂w10+∂(r31.w00)∂w10+∂(g20.w11)∂w10+∂(g30.w01)∂w10+g21.∂w10∂w10+∂(g31.w00)∂w10+∂(b20.w11)∂w10+∂(b30.w01)∂w10+b21.∂w10∂w10+∂(b31.w00)∂w10=0+0+r21.∂w10∂w10+0+0+0+g21.∂w10∂w10+0+0+0+b21.∂w10∂w10+0=0+0+r21+0+0+0+g21+0+0+0+b21+0(as, ∂x∂x=0)=r21+g21+b21 ... eq.(3.8)
Putting value from eq.(3.8) in eq.(3.2),
∂u10∂w10=∂(max(0,c10))∂w10=∂c10∂w10 (from eq.(3.6))=i21 (from eq.(3.8)) ... eq.(3.10)
Let us now find the left term of eq. (3.1) ∂E∂u10. This has already been computed in eq.(2.13) and given as:
∂E∂u10=(t−rfc).(−wfc)
Applying values from eq.(3.8) and eq.(2.13) in eq. (3.1),
∂E∂w10=∂E∂u10.∂u10∂w10=∂E∂u10.(i31) (from eq.(3.10))=(t−rfc).(−wfc).(r21+g21+b21) (from eq.(2.13)) ... eq. (3.11)
Eq.(3.11) can also be converted to a general equation that covers both cases 1 and 2. This general equation is also useful if we use a different activation function other than ReLu. This general form of eq. (3.11 ) is:
∂E∂w10=∂E∂u10.∂u10∂w10=(t−rfc).(−wfc).∂u10∂w10=(t−rfc).(−wfc).∂f(c10)∂w10 ... eq.(3.12)
Here, f(x) is the activation function.
It will be clear by now that any ∂E∂wxx is the term (t−rfc).(−wfc) multiplied by the element in the input matrix wxx. Therefore, for the remaining weights w01 and w11:
∂E∂w01=(t−rfc).(−wfc).(r30+g30+b30) ... eq.(4.1)∂E∂w11=(t−rfc).(−wfc).(r20+g20+b20) ... eq.(5.1)
and their general form is:
∂E∂w01=(t−rfc).(−wfc).∂f(c10)∂w10 ... eq.(4.2)∂E∂w11=(t−rfc).(−wfc).∂f(c10)∂w11 ... eq.(5.2)
A further general form to represent eqs.(2.15),(3.12), (4.2) and (5.2) for any weight wxy is:
∂E∂wxy=(t−rfc).(−wfc).∂f(cij)∂wxy ... eq.(6.1)
where, cij is the output from the pooling function and wxy is the filter weight.
Weight update rule:
Once the direction of the weight update is determined using ∂E∂wxy for a weight wxy, we update the weights based on the same rule as for a standard NN. Therefore,
If η is the learning rate and controls the numeric value by which weights are changed:
The weights for FC layer (wfc) layer will be updated as:
Δwfc=−η.∂E∂wfc=−η.(t−rfc).(−rp) (from eq.(1.3))=η.(t−rfc).(rp)⇒wfc=wfc+Δwfc
The weights for filter (w00,w10,w01,w11) will be updated as:
Δw00=−η.∂E∂w00=−η.(t−rfc).(−wfc).(∂f(c10)∂w00) (from eq.(2.15))=η.(t−rfc).(wfc).(∂f(c10)∂w00)⇒w00=w00+Δw00
Δw10=−η.∂E∂w10=−η.(t−rfc).(−wfc).(∂f(c10)∂w10) (from eq.(3.6))=η.(t−rfc).(wfc).(∂f(c10)∂w10)⇒w10=w10+Δw10
Δw01=−η.∂E∂w01=−η.(t−rfc).(−wfc).(∂f(c10)∂w01) (from eq.(4.2))=η.(t−rfc).(wfc).(∂f(c10)∂w01)⇒w01=w01+Δw01
Δw11=−η.∂E∂w11=−η.(t−rfc).(−wfc).(∂f(c10)∂w11) (from eq.(5.2))=η.(t−rfc).(wfc).(∂f(c10)∂w11)⇒w11=w11+Δw11
Note that if the output of the pooling layer is different than c10, you have to change the input values (ixy) used in the weight update rule.
This concludes the derivation of backpropagation for a CNN with 3 input matrices. In part-IV, we look how to derive backpropagation in presence of 2 sequential convolutional and ReLu layers. This is a step towards deriving backpropagation for a deep-learning (multi-layered) CNN.
Comments
Post a Comment