CUDA memory access analysis confusion for naive matrix multiplication Consider the following matrix multiply kernel: