I have two arrays, a and b. Each contain 16 bytes and I would like to add each b[i] to their corresponding a[i]. The arrays do not overlap and also I know that the resulting sums always fit in a byte each (important!).
void add16_reference (uint8_t *as, uint8_t *bs) {
for (auto i = 0; i < 16; i++) {
as[i] += bs[i];
}
}
I tried reimplementing this function in a bunch of ways and the best I could come up was
typedef unsigned __int128 uint128_t;
void add16_v3 (uint8_t *as, uint8_t *bs) {
uint128_t a, b, s;
std::memcpy(&a, as, 16);
std::memcpy(&b, bs, 16);
s = a + b;
std::memcpy(as, &s, 16);
}
Both GCC and Clang will happily compile this to 2 movs and 2 adds which is great but I can’t help but wonder if there are faster ways I’m just not aware of.
I reasoned that I can use a single addition because I know the individual sums always fit in a byte.
I’ve used godbolt (bless them) to inspect the resulting code https://godbolt.org/z/h4adMTnKn ; I can see that sometimes the compilers emit SIMD (?) instructions and sometimes not. This is important because in the benchmarks using plain old movs/adds is ~8 times faster https://quick-bench.com/q/z0374AXew8_eL8eDoQXn9XlJm9g
It also seems to me that there’s no sure way to convince the compilers to optimize one way or another (see that v1 compiles to movs/adds on godbolt while on quickbench it uses simd instead).
With all of the above in mind: is there a faster/nicer/more predictable way of adding these numbers together?