I was trying to benchmark instruction cache misses (by generating some functions largely consisting of nops
and calling each other) when I noticed something a bit weird which I’ve tried to reproduce. The setup for the benchmark is as follows.
I write functions in assembly (x86-64) of differing sizes; the functions consist of a varying number of nop
instructions followed by a ret
. I benchmark these functions using google benchmark.
Using google benchmark you can get the times for each function. The functions I tested only differ in the number of nop
instructions – the number of instructions going from 2^10 to 2^29 in powers of 2.
Finally, I divide the times measured for each function by the number of instructions in the function and multiply it with my cpu’s frequency, getting a rough cycles/nop count for each function.
This is what the result was – I’ve repeated the process a few times and the graph is always similar to this.
I expected the graph to be a flat line but there is a sudden rise around 2^23. Obviously the processor will optimize the execution (super-scalar, prefetching, etc.) but I see no reason as to why there is a sudden rise when the number of instructions is around 2^23.
So that is what the question is – what is causing that sudden rise in the number of cycles/nop? I’ve not had a chance to run the program on other processors/operating systems. I’m using Ubuntu 22.04 on an 11th gen Intel i5-11400H.
I’ve put the code to run stuff on a github repo so it should be easy enough for anyone to run the setup and get the graph. For the sake of the completeness of the question, the main code used to run the benchmark is as follows:
#include <benchmark/benchmark.h>
#include <x86intrin.h>
extern "C" void fun_1024();
extern "C" void fun_2048();
extern "C" void fun_4096();
extern "C" void fun_8192();
extern "C" void fun_16384();
extern "C" void fun_32768();
extern "C" void fun_65536();
extern "C" void fun_131072();
extern "C" void fun_262144();
extern "C" void fun_524288();
extern "C" void fun_1048576();
extern "C" void fun_2097152();
extern "C" void fun_4194304();
extern "C" void fun_8388608();
extern "C" void fun_16777216();
extern "C" void fun_33554432();
extern "C" void fun_67108864();
extern "C" void fun_134217728();
extern "C" void fun_268435456();
extern "C" void fun_536870912();
static void (*funs[])(void) = {
fun_1024,
fun_2048,
fun_4096,
fun_8192,
fun_16384,
fun_32768,
fun_65536,
fun_131072,
fun_262144,
fun_524288,
fun_1048576,
fun_2097152,
fun_4194304,
fun_8388608,
fun_16777216,
fun_33554432,
fun_67108864,
fun_134217728,
fun_268435456,
fun_536870912,
};
void BM(benchmark::State &state) {
for (auto _ : state) {
funs[state.range(0)]();
}
}
BENCHMARK(BM)->DenseRange(0, 19, 1);
BENCHMARK_MAIN();
where fun_x
is simply a function consisting of x-1 nop
s followed by a ret
.