I have written some very basic MPI code in C for image processing which iterates over N iterations and, in every iteration, completes a halo swap between processes using an MPI send/receive routine. I initially completed this halo swap using MPI_Ssend/MPI_Recv one after another separately, which resulted in a runtime of ~8 seconds on 64 processes over 100,000 iterations of the numerical routine. I then changed the messaging routine to MPI_Sendrecv in a combined call which resulted in a runtime of ~0.5 seconds, all other factors equal.
I then changed to 32 processes over 50,000 iterations. The separate scheme runtime now reduced to ~2.3 seconds, whereas the combined remained at ~0.5 seconds. I kept doing this and the results are in the table below
Processes | Iterations | Separate runtime (s) | Combined runtime (s) |
---|---|---|---|
64 | 100,000 | 8 | 0.50 |
32 | 50,000 | 2.32 | 0.49 |
16 | 25,000 | 0.86 | 0.43 |
8 | 12,500 | 0.48 | 0.41 |
4 | 6,250 | 0.39 | 0.37 |
2 | 3,125 | 0.38 | 0.37 |
So, we see that the weak scaling of the combined runtime is excellent, whereas for some reason the separate scaling is much worse. Does anyone know why this is?
The code is as follows:
// Combined send receive approach
MPI_Sendrecv(&old[Nx_mpi][1], Ny_mpi, MPI_DOUBLE, right, 1,
&old[0][1], Ny_mpi, MPI_DOUBLE, left, 1,
cart_comm, &statusArray[0]);
MPI_Sendrecv(&old[1][1], Ny_mpi, MPI_DOUBLE, left, 2,
&old[Nx_mpi+1][1], Ny_mpi, MPI_DOUBLE, right, 2,
cart_comm, &statusArray[1]);
// Seperate sends and receives approach
MPI_Ssend(&old[Nx_mpi][1], Ny_mpi, MPI_DOUBLE, right, 1, cart_comm);
MPI_Recv(&old[0][1], Ny_mpi, MPI_DOUBLE, left, 1, cart_comm, &statusArray[0]);
MPI_Ssend(&old[1][1], Ny_mpi, MPI_DOUBLE, left, 2, cart_comm);
MPI_Recv(&old[Nx_mpi+1][1], Ny_mpi, MPI_DOUBLE, right, 2, cart_comm, &statusArray[0]);
where cart_comm
is a 1D Cartesian topology I created. Obviously, one of the options is commented out to actually run the program. This code is running on the ARCHER2 HPC unit which uses Cray MPI and GCC compiler. The code is compiled with no optimisation flags.
Arthur Scott is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
16