Problem
With the below toy script, I am getting worse results on the GPU than the CPU. I am fairly new to GPU programming, so I’m not sure how to even debug. From my limited research, convolution is something that should be able to be greatly sped up by the GPU so I’m guessing I’m doing something wrong. Is the bottleneck data being sent to and from the GPU?
import timeit
import numpy as np
import torch
def to_torch(x, device):
if x.dtype == 'float64':
x = x.astype('float32')
return torch.from_numpy(x).to(device)
def min_max_normalize(x):
min_val = torch.min(x)
max_val = torch.max(x)
return (x - min_val) / (max_val - min_val)
def test_device(device):
conv1_weights = to_torch(np.random.randn(3, 3), device)
conv1_bias = to_torch(np.zeros((415, 415)), device)
def work():
data = to_torch(np.random.randn(397, 397), device)
data = min_max_normalize(data)
output_height = data.shape[0] - conv1_weights.shape[0] + 1
output_width = data.shape[1] - conv1_weights.shape[1] + 1
output = to_torch(np.zeros((output_height, output_width)), device)
for i in range(output_height):
for j in range(output_width):
output[i, j] = torch.sum(data[i:i+conv1_weights.shape[0], j:j+conv1_weights.shape[1]] * conv1_weights)
output += conv1_bias[:output_height, :output_width]
return timeit.timeit(work, number=5)
print(test_device("cpu"))
print(test_device("mps"))
which outputs:
4.249846041202545
51.15217095799744
As you can see, the GPU computation is over 10 times slower. Am I using the torch tensors incorrectly?