As the memory mountain shows, when fixing the stride, the read throughput decreases and the miss rate increases rapidly as the working set size increases. The figure below shows a slide of the memory mountain.
The reasons are given in CSApp(3rd Edition, Section 6.6.1):
For sizes up to 32 KB, the working set fits entirely in the L1 d-cache, and thus
reads are served from L1 at throughput of about 12 GB/s. For sizes up to 256 KB,
the working set fits entirely in the unified L2 cache, and for sizes up to 8 MB, the
working set fits entirely in the unified L3 cache. Larger working set sizes are served
primarily from main memory.
I am having trouble understanding why failing to fit entirely into L1 or L2 cache can make such a huge impact on read throughput, as each read request issued is finally served from L1-cache(but not directly from main memory), no matter how large the working set is.
Moreover, an average of min(1, (word_size * k) / B)
misses are expected for each loop iteration, where k
is the fixed stride and B
is the block size. The formula is consistent with that in the book, and obviously the formula itself has nothing to do with the working set size. However, miss rate does increase with the working set size. So I am wondering if there is anything that I have missed.
I have tried to simplify the experiment and reproduce it on my own machine, but the results are just the same as the figure shows.
#include <iostream>
const int MAXN = (1 << 26);
const int STRIDE = 31;
int n, a[MAXN];
int main() {
scanf("%d", &n);
long long sum = 0;
for (int i = 0; i < n; i += STRIDE)
sum += a[i];
return 0;
}
// (1 << 26) = 67108864
// (1 << 20) = 1048576
// (1 << 14) = 16384
// n = 67108864
==40770== D refs: 13,732,651 (13,535,381 rd + 197,270 wr)
==40770== D1 misses: 2,179,411 ( 2,177,019 rd + 2,392 wr)
==40770== LLd misses: 2,174,252 ( 2,172,627 rd + 1,625 wr)
==40770== D1 miss rate: 15.9% ( 16.1% + 1.2% )
==40770== LLd miss rate: 15.8% ( 16.1% + 0.8% )
// n = 1048576
==41013== D refs: 946,778 (749,511 rd + 197,267 wr)
==41013== D1 misses: 48,434 ( 46,042 rd + 2,392 wr)
==41013== LLd misses: 42,980 ( 41,392 rd + 1,588 wr)
==41013== D1 miss rate: 5.1% ( 6.1% + 1.2% )
==41013== LLd miss rate: 4.5% ( 5.5% + 0.8% )
// n = 16384
==41130== D refs: 746,968 (549,708 rd + 197,260 wr)
==41130== D1 misses: 15,064 ( 12,693 rd + 2,371 wr)
==41130== LLd misses: 9,683 ( 8,095 rd + 1,588 wr)
==41130== D1 miss rate: 2.0% ( 2.3% + 1.2% )
==41130== LLd miss rate: 1.3% ( 1.5% + 0.8% )