Consider the following matrix multiply kernel:
__global__ void mmNaive(int numArows, int numBCols, int Width, float** A, float** B, float* C) {
int tx = threadIdx.x;
int ty = threadIdx.y;
int bx = blockIdx.x;
int by = blockIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float sum = 0;
if (Row < numArows && Col < numBCols) {
for (int k = 0; k < Width; k++) {
sum += A[Row * Width + k] * B[k * Width + Col];
}
C[Row * Width + Col] = sum;
}
}
For simplicity let’s consider that A is 32×128 and B is 128×32. Hence, the resulting matrix C is 32×32. Now, let’s also assume that we will only use one 32×32 block. Hence, by = bx = 0 for the entire execution of the program.
Now, we know that a warp can only execute 32 threads and that threads are grouped first by X, then Y, then Z (thread coordinates) when creating warps.
Therefore, Warp0 will have all the threads with threadIdx.y = 0 and threadIdx.x 0-31, setting Row = 0 and Col =threadIdx.x
If my understanding is correct, when k = 0:
A[0] will accessed by all threads in Warp0 which can be made into a single memory request.
B[Col] = B[threadIdx.x] which will result in consecutive memory accessing which can also be coalesced into a single memory request.
Similarly for k = 1:
A[1] for all threads
B[Width + Col] again resulting in consecutive memory accesses.
However, the Programming Massively Parallel Processors A Hands-On Approach by David B. Kirk, Wen-Mei W Hwu (z-lib.org).pdf suggests the following:
In my case M is matrix A. While A is not coalesced, the book suggests that this access pattern is not preferable and that “The kernel in Fig. 4.3 is written so that both thread0,0 and thread0,1 access row 0 elements of M from the global memory.” Hence, that’s how the book leads into tiled multiplication. “If thread0,0 and thread0,1 can be made to collaborate so that these M elements are only loaded from the global memory once, the total number of accesses to the global memory can be reduced by half.“
But in my analysis thread(0,0) and thread(0,1) which are in Warp0, won’t need to load row 0 elements from the main memory.
There’s definitely something that I’m missing.
Thank you!