Using AVX512 instructions, I can use an index vector to gather 16 single precision values from an array. However, such gather operations are not that efficient and issue at a rate of only 2 scalar loads/cycle on my machine.
In one of my applications, I always need to gather four contiguous float
elements. In scalar pseudocode:
for (int i = 0; i < 16; ++i) {
result.x[i] = source[offset[i]*4 + 0];
result.y[i] = source[offset[i]*4 + 1];
result.z[i] = source[offset[i]*4 + 2];
result.w[i] = source[offset[i]*4 + 3];
}
NVIDIA GPUs can do sort of thing with a single ld.global.v4.f32
instruction. On the CPU, it also seems that one should be able to exploit this contiguity to do better than 4 16-wide gathers. Does anybody here know a faster AVX512 instruction sequence that would improve on the naive strategy? It’s fine to assume arbitrary alignment.