My platform is 2nd generation scalable Xeon, equipped with a non-inclusive cache. I run a series of tests that had the L2 stream prefetcher aggressively prefetching.
I use Perf to monitor performance, such as events like offcore_response.all_pf_data_rd
and offcore_response.pf_l2_data_rd
.
I understand that both of these performance event metrics focus on the efficiency of cache prefetching, but what puzzles me is that the output does not match up with its sub-events.
For example, the three perf events below seem like the value of the first should equal the sum of the latter two, but in reality, it does not. Similarly, for pf_l2_data_rd
, I observed a comparable discrepancy.
offcore_response.all_pf_data_rd.any_response
offcore_response.all_pf_data_rd.l3_hit.any_snoop
offcore_response.all_pf_data_rd.l3_miss.any_snoop
offcore_response.pf_l2_data_rd.any_response
offcore_response.pf_l2_data_rd.l3_hit.any_snoop
offcore_response.pf_l2_data_rd.l3_miss.any_snoop
So my questions are:
l3_hit
indicates that the prefetched target is in cache, so why does it trigger prefetchers? Does this mean that the hw prefetcher does not first determine if it is incache, or thishit
is the determination process? (l2_rqsts.pf_hit
is similar)- Does this suggest that some prefetch requests are passed from L2 to L3, while others bypass L3 and go straight to memory?