I’m measuring the latency of a store instruction on an x86-64 processor and would like to understand the nuances of timing this instruction. Here’s my setup and the specific questions I have:
Setup:
I am using std::chrono::steady_clock::now() to measure the time before and after a store instruction.
The address being written to is initially in the Shared (S) state, and it transitions to the Exclusive (E) or Modified (M) state before the write.
What Time Is Captured?
When I measure the time using std::chrono::steady_clock::now(), what exactly am I capturing? Is it the time for the store instruction to enter the store buffer, or does it include the time for the instruction to be fully committed and visible in memory?
Instruction Retirement and Timing:
Will the timing code immediately following the store instruction run only after the store instruction has been fully retired (i.e., committed and globally visible)? Or can the code below the store instruction execute even if the store has not yet retired?
Context:
I’m using this measurement to analyze memory access performance and understand how different states (S, E, M) affect store latency. Any insights or recommendations for accurate latency measurement techniques on x86-64 architectures would be greatly appreciated.
How to Measure True Latency:
If std::chrono::steady_clock::now() does not capture the latency until the store instruction is fully committed, what are the best practices or methods to accurately measure the latency including the commitment time?