The MSVC compiler has a “__restrict” keyword. It works when working with raw pointers directly but apparently not when the pointers are wrapped in structs.
I tried the following code:
#include <memory>
#include <intrin.h>
float dot(__m128 a,__m128 b) {
__m128 tmp = _mm_add_ps(a,b);
tmp = _mm_hadd_ps(tmp,tmp);
return _mm_cvtss_f32(_mm_hadd_ps(tmp,tmp));
}
template<typename T,int Stride=1> struct v_span {
T *data;
operator __m128() const {
if constexpr(Stride == 1) {
return _mm_load_ps(data);
} else {
return _mm_set_ps(data[0],data[Stride],data[Stride*2],data[Stride*3]);
}
}
T &operator[](int i) const { return data[i*Stride]; }
};
template<typename T> struct m_span {
T *data;
auto row(int i) const { return v_span<T>{data + i*4}; }
auto col(int i) const { return v_span<T,4>{data + i}; }
};
void multiply1(m_span<float> out, m_span<const float> a, m_span<const float> b) {
for(int c=0; c<4; ++c) {
for(int r=0; r<4; ++r) {
out.row(r)[c] = dot(a.row(r),b.col(c));
}
}
}
void multiply2(float* __restrict out, m_span<const float> a, m_span<const float> b) {
for(int c=0; c<4; ++c) {
for(int r=0; r<4; ++r) {
out[r*4 + c] = dot(a.row(r),b.col(c));
}
}
}
void multiply3(float* __restrict _out, m_span<const float> a, m_span<const float> b) {
m_span<float> out{_out};
for(int c=0; c<4; ++c) {
for(int r=0; r<4; ++r) {
out.row(r)[c] = dot(a.row(r),b.col(c));
}
}
}
compiled with “/O2 /std:c++20” using Godbolt and looked at the assembly. multiply2 looks good; b.col(c)
gets loaded from memory once per outer loop. But multiply3 generates the same assembly as multiply1, where it gets loaded every iteration. By contrast, Clang produces the same assembly for multiply2 and multiply3.
I have also tried adding the __restrict keyword to the pointer inside v_span and m_span to no avail. I even tried creating a struct that wraps “float” with conversion operators, so that “out” would contain a pointer to a different type, only to learn that MSVC allows aliasing between different types, unlike Clang and GCC by default.
I know I can simply store b.col(c)
in a variable, and that there is a faster way to do this calculation anyway, but this is just a simplified example. I have far more complicated calculations I need to do and my real functions are templates that accept both fixed-sized and variable-sized vectors and matrices.
I know that there are other ways areas where MSVC falls behind in optimizing, compared to Clang and GCC. MSVC is not my compiler of choice but I want anyone to be able to download and compile my project with as little hassle as possible.