I like to load several consecutive SIMD vectors and store them in a container with a size known at compile time. (That would allow me to write portable code for architectures with different register counts. For AVX2 (Arm Neon), I load 3 (4) vectors.)
When I look at the AVX2 code, I am surprised about the number of vmovaps
operations.
Consider the following code:
#include <array>
#include <immintrin.h>
using InpT = const float* __restrict__;
template <size_t unroll_cout>
auto load(InpT data, size_t index) {
using vecs = std::array<__m256, unroll_cout>;
alignas(32) vecs loaded;
for (size_t i = 0; i < unroll_cout; ++i) {
loaded[i] = _mm256_load_ps(data + index);
index += 8;
}
return loaded;
}
template auto load<3ul>(InpT data, size_t index);
auto loadSingle(InpT data, size_t index) {
auto vec = _mm256_load_ps(data + index);
return vec;
}
For the general load operation, I see:
auto load<3ul>(float const*, unsigned long):
mov rax, rdi
vmovaps ymm0, ymmword ptr [rsi + 4*rdx]
vmovaps ymm1, ymmword ptr [rsi + 4*rdx + 32]
vmovaps ymm2, ymmword ptr [rsi + 4*rdx + 64]
vmovaps ymmword ptr [rdi + 64], ymm2
vmovaps ymmword ptr [rdi + 32], ymm1
vmovaps ymmword ptr [rdi], ymm0
vzeroupper
ret
Whilst loading just one value works as I expect:
loadSingle(float const*, unsigned long):
vmovaps ymm0, ymmword ptr [rdi + 4*rsi]
ret
Loading three consecutive vectors seems to move data in memory between two different regions.
Is the code for load<3>
optimal? If not, how can I improve it?
5