I tried parallelizing a code snippet with OpenMP, it turns out that engaing OpenMP takes 25X time for the program to finish. Is there anything wrong? How can I optimize it?
#include <iostream>
#include <cmath>
#include <random>
#include <chrono>
#include <cstdlib>
#include <omp.h>
using namespace std;
int main() {
unsigned long long black_square = 1, digit_square = 13;
//auto n = ((black_square)<<11) * static_cast<unsigned long long>(pow(digit_square,10));
auto n = static_cast<unsigned long long>(1e9);
srand(0);
int tmp = 0;
std::random_device rd; // Will be used to obtain a seed for the random number engine
std::mt19937 gen(rd()); // Standard mersenne_twister_engine seeded with rd()
std::uniform_int_distribution<> distrib(1, 6);
auto tStart = std::chrono::high_resolution_clock::now();
//#pragma omp parallel for schedule(static) reduction(+:tmp)
#pragma omp parallel for schedule(static) reduction(+:tmp) num_threads(8)
for (unsigned long long i=0; i<n; i++) tmp = (tmp+(5==rand()%6))%static_cast<int>(1e9);
//for (unsigned long long i=0; i<n; i++) tmp = (tmp+(5==distrib(gen)))%static_cast<int>(1e9);
tmp%=static_cast<int>(1e9);
auto tEnd = std::chrono::high_resolution_clock::now();
cout << tmp << " obtained after " << n << " iterations in " << (tEnd-tStart).count()/1e9 << "s." << endl;
return 0;
}
The code is compiled by g++ -o a.out -O3 -std=c++11 -fopenmp tmp.cpp
where g++
has version 8.5.0 20210514
. The OS is RHEL8.9 and there are 20 Intel Xeon CPUs at 2.593GHz
.
The serial code on average runs in 7.4s while the parallel code on average runs in 180s. Options -O3
, -O2
, -O1
have similiar results. Random generator mt19937
could reduce the performance gap significantly, but the parallel code is still much slower than the serial version. Increasing or decreasing n
leads to similar results as well.