How many cache lines does the adjacent cache line prefetcher bring into cache?

I was investigating the effectiveness of the adjacent cache line prefetcher and its impact on the number of cache lines prefetched from DRAM. Initially, I assumed it fetched only one more adjacent line.

My goal was to determine if and how many cache lines the prefetcher was bringing. The only relevant information I found was in the “Intel 64 and IA-32 Architectures Software Developer Manual” regarding the 0X1A4 MSR register. This register indicated a value of 0 on my Intel(R) Xeon(R) Gold 6442Y (4th Gen), suggesting all prefetchers were enabled.

Experiment Design:

Allocated a 1GB random array of uint64_t values (exceeding my 128MB
L3 cache) and initialized them with values 1-20.
Read 8 random indicies from the intialized array, ensuring the indices were aligned to 4 cache lines alignment.
Used the accumulated sum modulo operation to randomly select an
adjacent line to an accessed cache line.
Introduced a variable to prevent excessive out-of-order execution.
Compiled with simple “-O2” flag with GCC

Code Snippet:

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define ARRAY_SIZE 135000000 // 1GB array
#define CACHELINE_SIZE 64
#define ALIGN32(x) ((x) & ~(31))
#define ELEMENT_SIZE sizeof(uint64_t)

int main(int argc, char *argv[]) {
  uint64_t *array =
      (uint64_t *)aligned_alloc(256, (ARRAY_SIZE * sizeof(uint64_t)));

  if (argc != 2) {
    printf("Usage: ./a.out <offset in Bytes> n");
    return 1;
  }

  int offset = atoi(argv[1]);
  int cacheline_offsets = (offset * ELEMENT_SIZE) / (CACHELINE_SIZE);
  printf("Number of cachelines offset is %dn",cacheline_offsets);

  // Initialize the array
  for (int i = 0; i < ARRAY_SIZE; i++) {
    array[i] = rand() % 20;
  }

  // Measure execution time
  unsigned long sum = 0;
  uint64_t indices[8] = {0};

  clock_t start = clock();
  for (uint64_t i = 0; i < ARRAY_SIZE; i += 8) {
    uint64_t oldSum = sum;
    for (uint64_t j = 0; j < 8; j++) {
      indices[j] = (rand() ^ oldSum) % ARRAY_SIZE;
      indices[j] = ALIGN32(indices[j]);
      sum += array[indices[j]];
    }
    sum += array[indices[sum % 8] + offset];
  }
  clock_t end = clock();

  double time_taken = ((double)(end - start)) / CLOCKS_PER_SEC;
  printf("Time taken: %f secondsn", time_taken);
  printf("sum value: %ldn", sum); // Use `sum` so it wont be optimized out

  free(array);
  return 0;
}

The results:

Expected vs. Observed Behavior:

Expected best performance for accessing the same cache line.
Predicted slightly worse performance for accessing an adjacent line(s) prefetched by the hardware (maybe 1-2 additional cache lines).
Anticipated a further performance drop for accessing lines further away.
However, the observed results show a different pattern where the same results are for the next (7!) adjacent cache lines and the drop isn’t that big as I expected it to be ( I thought it would be 10x since its DRAM miss and not L1 miss)

Question:

Could the observed performance discrepancies be due to flaws in my experimental design or incorrect assumptions?
Are there alternative approaches for measuring the impact of the adjacent cache line prefetcher?

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 22:16

Thẻ: arraysccachingmemoryprefetch

Thiết kế website giá rẻ

Danh mục

How many cache lines does the adjacent cache line prefetcher bring into cache?