Similar to this question and this issue, I encountered the problem of different results depending on operations being performed for a batch or for a single sample, with the difference, that in those posts, the errors are in the range of e-5 to e-6, while for me it’s in the range of 0.2 – 0.7
Running the example below will print the biggest difference between final activations for the first data sample, resulting in some value ranging from 0.2 to 0.7, depending on the seed. Tested on both cuda and cpu.
import torch
import torch.nn as nn
num_samples = 100
num_neurons = 200
data_size = 150_000
dataset = torch.randn(num_samples, data_size)
linear_layer = nn.Linear(data_size, num_neurons)
with torch.no_grad():
for i in range(num_samples):
linear_layer.weight[i].copy_(dataset[i])
print(torch.max(torch.abs(linear_layer(dataset[0]) - linear_layer(dataset)[0])))