using openmp, c++, icpx, on intel devl cloud’s 4x gpu max 1100 + xeon
code:
#pragma omp for
for(k=0; k<nz; k++){
for(j=0; j<ny; j++){
for(i=0; i<nx; i++){
for(m=0; m<5; m++){
rsd[k][j][i][m]=-frct[k][j][i][m];
}
tmp=1.0/u[k][j][i][0];
rho_i[k][j][i]=tmp;
qs[k][j][i]=0.50*(u[k][j][i][1]*u[k][j][i][1]
+u[k][j][i][2]*u[k][j][i][2]
+u[k][j][i][3]*u[k][j][i][3])
*tmp;
}
}
}
Its succefully offloaded to one device (verification of the whole computation is still succesful),
but unsuccesfull with 2x gpus (even though other similar parts of the codes does work on multi gpu).
I do see in profiler activity on both gpus, but the correctness of the computation go wrong for some reason.
Above code** i use correctly device data mapping + #pragma omp target teams distribute parallel for device(device_id)
For 2 gpus, i use domain decomposition method (divide to 2 chunks the k loop), using k_start and k_stop.
Note – other than the next reference, i haven’t found any complete example for this use case:
https://pawseysc.github.io/sc20-gpu-offloading/07-multi-gpu/index.html
Exactly what is happening is:
Above the code**, i got omp parallel which spawn 2 cpu threads.
I use the same number of cpu threads and gpus (e.g. 2).
When those 2 cpu threads reach code**, i assign device_id for each cpu thread, and perform offload task on half of the k iterations’ space on each.
The data mapping to chunks is done correctly, proof to that is the other similar code parts that does work with 2-4 gpus.
Also just in case, i put barrier before and after code**.
Any suggestions what can go wrong here please? Do i miss something?
Max is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.