my goal is to understand torch.nn.Conv2d
. Right now, I am confused with the kernel_size
.
Let’s say, there is an image with height 5px, width 4px. We’re going to apply a 2D convolution with out-channels=4
, kernel_size=(3,2)
, and stride=1
over an input with width 4px, height 5px will output (4, 3, 3)
where (channel, height, width)
. stride=1
will move the kernel by 1 pixel. So, an image with width 4px, height 5px, will generate new image with width 3px, height 3px
Everything checks out at this point.
However, when I run m.weight.shape
, it returns (4, 3, 3, 2)
. Shouldn’t it return (4, 3, 2)
?
My assumption:
- Each channel have it’s own kernel. Is this correct?
The code
import torch.nn as nn
from torchvision.io import read_image
# Define the Conv2d layer
m = nn.Conv2d(in_channels=3, out_channels=4, kernel_size=(3,2))
# Reads a JPG image into a 3 dimensional RGB. The values of the output tensor are uint8 in [0, 255].
input = read_image("2-rgb4x5.jpg")
print("input[0] with [0, 255] range")
print(input[0])
# Normalize the values to [0, 1] range
input = input.float() / 255.0
# Perform the convolution
output = m(input)
# Print the input and output
print("m.weight.shape or kernel")
print(m.weight.shape)
print("m.weight[0]")
print(m.weight[0][0])
print("input.shape")
print(input.shape)
print("output.shape")
print(output.shape)
print(output)