Implement DepthWiseConv1d in CUDA
I am trying to implement the nn.Conv1d using CUDA. However, I cannot get the correct answer. It would be great if anyone could help me debug it. Currently, the dimensions of input and output is correct, and the input x is (B, L, D) (batch, length, dimension).