I’m writing a GEMM code in C++, but the loop order actually influences a lot to the efficiency. Code:
// The first edition
void gemm_cpu_naive(
int* __restrict__ C,
const int* __restrict__ A,
const int* __restrict__ B,
const int n,
const int m,
const int k){
for(int l=0; l<k; ++l)
for(int i=0; i<n; ++i)
for(int j=0; j<m; ++j){
C[i*m+j] += A[i*k+l] * B[l*m+j];
}
}
// The second edition
void gemm_cpu_naive(
int* __restrict__ C,
const int* __restrict__ A,
const int* __restrict__ B,
const int n,
const int m,
const int k){
for(int i=0; i<n; ++i)
for(int j=0; j<m; ++j)
for(int l=0; l<k; ++l){
C[i*m+j] += A[i*k+l] * B[l*m+j];
}
}
n = 64, m = 64, k =64, the first edition cost 118 us, the second edition just cost 63 us.
The memory access pattern seems the same, A in row-order, B in column-order.
I searched a lot, one answer is the forst edition frequent writes to the same memory location, which can cause cache thrashing.
I wonder it’s the main cause of the huge difference of the efficiency? Or is the fact still undiscovered?
soyail is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.