I was investigating the effectiveness of the adjacent cache line prefetcher and its impact on the number of cache lines prefetched from DRAM. Initially, I assumed it fetched only one more adjacent line.
My goal was to determine if and how many cache lines the prefetcher was bringing. The only relevant information I found was in the “Intel 64 and IA-32 Architectures Software Developer Manual” regarding the 0X1A4 MSR register. This register indicated a value of 0 on my Intel(R) Xeon(R) Gold 6442Y (4th Gen), suggesting all prefetchers were enabled.
Experiment Design:
- Allocated a 1GB random array of uint64_t values (exceeding my 128MB
L3 cache) and initialized them with values 1-20. - Read 8 random indicies from the intialized array, ensuring the indices were aligned to 4 cache lines alignment.
- Used the accumulated sum modulo operation to randomly select an
adjacent line to an accessed cache line. - Introduced a variable to prevent excessive out-of-order execution.
- Compiled with simple “-O2” flag with GCC
Code Snippet:
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 135000000 // 1GB array
#define CACHELINE_SIZE 64
#define ALIGN32(x) ((x) & ~(31))
#define ELEMENT_SIZE sizeof(uint64_t)
int main(int argc, char *argv[]) {
uint64_t *array =
(uint64_t *)aligned_alloc(256, (ARRAY_SIZE * sizeof(uint64_t)));
if (argc != 2) {
printf("Usage: ./a.out <offset in Bytes> n");
return 1;
}
int offset = atoi(argv[1]);
int cacheline_offsets = (offset * ELEMENT_SIZE) / (CACHELINE_SIZE);
printf("Number of cachelines offset is %dn",cacheline_offsets);
// Initialize the array
for (int i = 0; i < ARRAY_SIZE; i++) {
array[i] = rand() % 20;
}
// Measure execution time
unsigned long sum = 0;
uint64_t indices[8] = {0};
clock_t start = clock();
for (uint64_t i = 0; i < ARRAY_SIZE; i += 8) {
uint64_t oldSum = sum;
for (uint64_t j = 0; j < 8; j++) {
indices[j] = (rand() ^ oldSum) % ARRAY_SIZE;
indices[j] = ALIGN32(indices[j]);
sum += array[indices[j]];
}
sum += array[indices[sum % 8] + offset];
}
clock_t end = clock();
double time_taken = ((double)(end - start)) / CLOCKS_PER_SEC;
printf("Time taken: %f secondsn", time_taken);
printf("sum value: %ldn", sum); // Use `sum` so it wont be optimized out
free(array);
return 0;
}
The results:
Expected vs. Observed Behavior:
Expected best performance for accessing the same cache line.
Predicted slightly worse performance for accessing an adjacent line(s) prefetched by the hardware (maybe 1-2 additional cache lines).
Anticipated a further performance drop for accessing lines further away.
However, the observed results show a different pattern where the same results are for the next (7!) adjacent cache lines and the drop isn’t that big as I expected it to be ( I thought it would be 10x since its DRAM miss and not L1 miss)
Question:
Could the observed performance discrepancies be due to flaws in my experimental design or incorrect assumptions?
Are there alternative approaches for measuring the impact of the adjacent cache line prefetcher?
3