I’m trying to understand the throughput of iterating over an array with varying sizes N
.
int a[N];
for (int t = 0; t < K; t++)
for (int i = 0; i < N; i++)
a[i]++;
Throughput goes down as the size of the array goes up. I understand that throughput is high at first because the array fits the lowest layer of cache L1, then L2 and finally L3 at 4MB (shown in the image below)
Question: why doesn’t the hardware prefetcher fetch data elements to the L1 cache while evicting other older elements, allowing us to keep the L1 throughput we initially see in first section of the graph?
Source: the experiment and image is from https://en.algorithmica.org/hpc/cpu-cache/bandwidth