How to understand intel performance counters for evaluating cpu stalls due to memory subsystem?

I played with various Intel performance counters by directly accessing the Model Specific Registers (MSRs) on a Xeon Skylake chip, and I found some readings about the cache-miss caused stalls like CYCLE_ACTIVITY.STALLS_MEM_ANY and CYCLE_ACTIVITY.STALLS_L3_MISS hard to understand.

What I found is that, when a simple program scans through a 4GB memory region of random numbers (and finds the largest to avoid mem access being optimized away), the program should be very bounded by the DRAM to L3-cache transmission speed, where the performance counters does show that L3 is experiencing tons of misses. So when I inspect CYCLE_ACTIVITY.STALLS_MEM_ANY the total cycle stalls due to oustanding load of the memory subsystem, this is high as expected (like stall for 1GHz as the total cycles are 2.5GHz). However, I can’t understand why CYCLE_ACTIVITY.STALLS_L3_MISS is extremely low, like less than 5% of all the stalls. Whereas CYCLE_ACTIVITY.STALLS_L1D_MISS is very high and close to the total stalls. So I’m confused by this, I thought when my working set is 4GB, any level of cache should be useless, and the L3 cache-misses should be accountable for most of the stalls just to read data from the DRAM, but the counters are only showing L1-miss being accountable for the stalls.

My speculation is that when L1 prefetches from L2, misses, then incurs L3 miss, the cpu stalls before the load is satisfied, but L3-miss counter is not incremented for this case, so we only see L1 being accountable?

Problem with the Intel Optimization manual

From my own understanding, the percent of cycles wasted due to DRAM access would be simply changes in CYCLE_ACTIVITY.STALLS_L3_MISS / changes in the core’s Unhalt cycles. I’ve seen that this manual from Intel explains how to do analysis on the performance counters, where even the 2024 version contains obsolete content, where the section 22.5.2.3 provides several formulas for evaluating each level of cache’s effect on the stalls,

<code>%Memory_Bound = CYCLE_ACTIVITY.STALLS_LDM_PENDING / CLOCKS

%L1 Bound = (CYCLE_ACTIVITY.STALLS_LDM_PENDING -CYCLE_ACTIVITY.STALLS_L1D_PENDING)/ CLOCKS

%L2 Bound =(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING)/ CLOCKS

%L3 Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Hit_fraction / CLOCKS

%MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS

</code>

<code>%Memory_Bound = CYCLE_ACTIVITY.STALLS_LDM_PENDING / CLOCKS %L1 Bound = (CYCLE_ACTIVITY.STALLS_LDM_PENDING -CYCLE_ACTIVITY.STALLS_L1D_PENDING)/ CLOCKS %L2 Bound =(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING)/ CLOCKS %L3 Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Hit_fraction / CLOCKS %MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS </code>

%Memory_Bound = CYCLE_ACTIVITY.STALLS_LDM_PENDING / CLOCKS

%L1 Bound = (CYCLE_ACTIVITY.STALLS_LDM_PENDING -CYCLE_ACTIVITY.STALLS_L1D_PENDING)/ CLOCKS

%L2 Bound =(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING)/ CLOCKS

%L3 Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Hit_fraction / CLOCKS

%MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS

but firstly when this was written, there wasn’t L3 specific core performance counters. The more confusing is that, these CYCLE_ACTIVITY.STALLS_XX_PENDING are slightly differently defined with the XX_MISS ones for Ivy Bridge on this site:

As I don’t really understand the difference and the PENDING ones no longer present in newer architectures, I’m more confused for how to evaluate DRAM bandwidth’s effect on cpu stalls.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 06:31

Thẻ: cpu-architectureintelperformancecounterintel-pmuperformance-monitor

How to understand intel performance counters for evaluating cpu stalls due to memory subsystem?

Problem with the Intel Optimization manual

<code>%Memory_Bound = CYCLE_ACTIVITY.STALLS_LDM_PENDING / CLOCKS

%L1 Bound = (CYCLE_ACTIVITY.STALLS_LDM_PENDING -CYCLE_ACTIVITY.STALLS_L1D_PENDING)/ CLOCKS

%L2 Bound =(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING)/ CLOCKS

%L3 Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Hit_fraction / CLOCKS

%MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS

</code>

%Memory_Bound = CYCLE_ACTIVITY.STALLS_LDM_PENDING / CLOCKS

%L1 Bound = (CYCLE_ACTIVITY.STALLS_LDM_PENDING -CYCLE_ACTIVITY.STALLS_L1D_PENDING)/ CLOCKS

%L2 Bound =(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING)/ CLOCKS

%L3 Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Hit_fraction / CLOCKS

%MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS

As I don’t really understand the difference and the PENDING ones no longer present in newer architectures, I’m more confused for how to evaluate DRAM bandwidth’s effect on cpu stalls.

Filed under: Kiến thức lập trình - @ 06:31

Thẻ: cpu-architectureintelperformancecounterintel-pmuperformance-monitor

Thiết kế website giá rẻ

Danh mục

How to understand intel performance counters for evaluating cpu stalls due to memory subsystem?

Problem with the Intel Optimization manual

How to understand intel performance counters for evaluating cpu stalls due to memory subsystem?

Problem with the Intel Optimization manual