Why is ARM NEON SIMD Sum is slower than serial sum?
I have been trying to benchmark memory bandwidth on my M2 Mac and one thing I noticed that when I try to use ARM NEON SIMD it seems that it has a slower processing time and lower memory bandwidth. This is my code: