Let’s imagine an input image 3x3x2, the last number is the number of channels. This simplified setup helps me understand the next step. Consider a conv layer with 1 kernel of size 2x2x2, say I have 2 “subkernels” of size 2×2. Now I scan each input channel at stride 1 to create 2 feature maps, each is of size 2×2. At the very end I perform element wise sum of these 2 feature maps to create a single 2×2 output.
My question is: Why do I perform the last summation? What information is extracted by such summation? Most literature explains the motiev of using feature maps, but not the reason behind this particular operation.
Most literature only explains the motiev behind creating feature maps.
Matt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.