I have been trying to benchmark memory bandwidth on my M2 Mac and one thing I noticed that when I try to use ARM NEON SIMD it seems that it has a slower processing time and lower memory bandwidth. This is my code:
int arm_neon_simd_sum(const std::vector<int8_t>& arr) {
// Check if the array is empty
if (arr.empty()) return 0;
int32x4_t v_sum = vdupq_n_s32(0); // Initialize NEON vector to hold partial sums
size_t i = 0;
// Process the array in chunks of 16 elements
for (; i + 15 < arr.size(); i += 16) {
int8x16_t v_data = vld1q_s8(&arr[i]); // Load 16 elements into a NEON register
// Convert int8x16_t to int16x8_t
int16x8_t v_data_low = vmovl_s8(vget_low_s8(v_data)); // Lower 8 elements to int16
int16x8_t v_data_high = vmovl_s8(vget_high_s8(v_data)); // Upper 8 elements to int16
// Convert int16x8_t to int32x4_t and accumulate
v_sum = vaddq_s32(v_sum, vpaddlq_s16(v_data_low));
v_sum = vaddq_s32(v_sum, vpaddlq_s16(v_data_high));
}
// Horizontal add the vector to get the sum of all elements
int32_t sum_array[4];
vst1q_s32(sum_array, v_sum);
int sum = sum_array[0] + sum_array[1] + sum_array[2] + sum_array[3];
// Handle the remaining elements
for (; i < arr.size(); ++i) {
sum += arr[i];
}
return sum;
}
And here are my testing result:
2024-07-12 21:31:51.101 -------- running arm_neon_simd_sum --------
....................................................................................................
2024-07-12 21:32:03.808 Result: 536870900
2024-07-12 21:32:03.808 Total execution time: 12.705522168999996 seconds
2024-07-12 21:32:03.808 Total bytes processed: 107374182400
2024-07-12 21:32:03.808 Throughput: 7.870593484460515 GB/s
And In comparison a naive serial sum seems to be faster:
2024-07-13 10:28:54.706 -------- running sequential sum --------
....................................................................................................
2024-07-12 21:31:02.750 Result: 536870900
2024-07-12 21:31:02.750 Total execution time: 3.303718871999998 seconds
2024-07-12 21:31:02.750 Total bytes processed: 107374182400
2024-07-12 21:31:02.750 Throughput: 30.268919322261286 GB/s
2024-07-12 21:31:02.750 -------- running random access sum -------
I am wondering why? Am I doing something wrong with my SIMD code that are slowing my code down?
Note this is with -O3
so I do think some auto vectorization might be happening for serial code, so my main concern is how can I achieve the performance of that level with my own code?