Optimal instruction sequence for AVX512 gather of 4D vectors
Using AVX512 instructions, I can use an index vector to gather 16 single precision values from an array. However, such gather operations are not that efficient and issue at a rate of only 2 scalar loads/cycle on my machine.