I played with various Intel performance counters by directly accessing the Model Specific Registers (MSRs) on a Xeon Skylake chip, and I found some readings about the cache-miss caused stalls like CYCLE_ACTIVITY.STALLS_MEM_ANY
and CYCLE_ACTIVITY.STALLS_L3_MISS
hard to understand.
What I found is that, when a simple program scans through a 4GB memory region of random numbers (and finds the largest to avoid mem access being optimized away), the program should be very bounded by the DRAM to L3-cache transmission speed, where the performance counters does show that L3 is experiencing tons of misses. So when I inspect CYCLE_ACTIVITY.STALLS_MEM_ANY
the total cycle stalls due to oustanding load of the memory subsystem, this is high as expected (like stall for 1GHz as the total cycles are 2.5GHz). However, I can’t understand why CYCLE_ACTIVITY.STALLS_L3_MISS
is extremely low, like less than 5% of all the stalls. Whereas CYCLE_ACTIVITY.STALLS_L1D_MISS
is very high and close to the total stalls. So I’m confused by this, I thought when my working set is 4GB, any level of cache should be useless, and the L3 cache-misses should be accountable for most of the stalls just to read data from the DRAM, but the counters are only showing L1-miss being accountable for the stalls.
My speculation is that when L1 prefetches from L2, misses, then incurs L3 miss, the cpu stalls before the load is satisfied, but L3-miss counter is not incremented for this case, so we only see L1 being accountable?
Problem with the Intel Optimization manual
From my own understanding, the percent of cycles wasted due to DRAM access would be simply changes in CYCLE_ACTIVITY.STALLS_L3_MISS
/ changes in the core’s Unhalt cycles. I’ve seen that this manual from Intel explains how to do analysis on the performance counters, where even the 2024 version contains obsolete content, where the section 22.5.2.3 provides several formulas for evaluating each level of cache’s effect on the stalls,
%Memory_Bound = CYCLE_ACTIVITY.STALLS_LDM_PENDING / CLOCKS
%L1 Bound = (CYCLE_ACTIVITY.STALLS_LDM_PENDING -CYCLE_ACTIVITY.STALLS_L1D_PENDING)/ CLOCKS
%L2 Bound =(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING)/ CLOCKS
%L3 Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Hit_fraction / CLOCKS
%MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS
but firstly when this was written, there wasn’t L3 specific core performance counters. The more confusing is that, these CYCLE_ACTIVITY.STALLS_XX_PENDING
are slightly differently defined with the XX_MISS
ones for Ivy Bridge on this site:
As I don’t really understand the difference and the PENDING ones no longer present in newer architectures, I’m more confused for how to evaluate DRAM bandwidth’s effect on cpu stalls.