I have tested a simple hybrid mpi/openmp programming:
void mpi_openmp_run ()
{
double t00 = MPI_Wtime();
for(std::size_t tit=0; tit<1000; ++tit)
{
int tid ;
double t0, t1;
#pragma omp parallel private(tid, t0, t1)
{
t0 = MPI_Wtime();
#pragma omp for
for(std::size_t zindex=0; zindex<10000000000; zindex++)
{
tid = omp_get_thread_num();
}
t1 = MPI_Wtime();
#pragma omp barrier
if(tid==0)
{
std::cout <<" Multithread wall clock: "<<<< t1-t0<<" in threads: " << omp_get_thread_num()<<std::endl;
}
}
double t11=MPI_Wtime();
if(myrank==0)
{
printf("Wall Clock = %15.6fn",t11-t00);
}
}
I tested the code with 2 cpus * 2 threads, I got
running time for each thread is about 2.062, and total time is 404.56
If I use only 1 cpu + 4 threads, I got
running time for each thread is about 1.039, and total time is 202.79
I wonder why there is a difference of factor about 2, since in this simple example, there is no communications among cpus.
In this test, I have basically allocated the same number of computational resources (2*2=4). Shall this bring the same performance? This is really confusing for me.