I have a question about my attempt to use different chunk size in OpenMP to avoid false-sharing.
So I created 2 large vectors, and measured sum operation time with 2 different chunk sizes.
- I used default #pragma omp parallel for
- Commented option: I used chunk such that each thread should work with it’s own cache-line on writting operation. std::hardware_destructive_interference_size in my case is 64 bytes.
#include <omp.h>
#include <vector>
#include <iostream>
#include <chrono>
#include <new>
int main(int argc, char const *argv[])
{
std::size_t n = 10000000 * 64;
std::vector<int32_t> a(n, 1);
std::vector<int32_t> b(n, 1);
std::vector<int32_t> c(n, 0);
auto start = std::chrono::system_clock::now();
// #pragma omp parallel for schedule(static, std::hardware_destructive_interference_size / sizeof(int32_t))
#pragma omp parallel for
for (std::size_t i = 0; i < a.size(); ++i)
{
c[i] = a[i] + b[i];
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << std::endl;
return 0;
}
However I always get that “#pragma omp parallel for” works twice faster than second option. Why does it happen? I actually just break reading cache, trying to optimize writting operation?
I saw the Use of OpenMP chunk to break cache answer, but it actualy doesn’t help me.