Consider a benchmark of a C++ application. It does little or no I/O in proportion to its overall runtime — it is compute-intensive. It is a single-threaded program. It pages/reads in all of its data at the outset, and then runs many iterations of the core task at hand, so as to average out cache or other ephemeral variations.
We run it on a large, multi-core, Linux system with far more memory than it uses, which has a NUMA memory hierarchy. We run it on the machine when it is ‘idle’. Of course, it has some number of the usual daemons floating around, but there should be plenty of cores and memory to spare to keep them happy.
We observe a surprising (to us) range of variation in the wall-clock time that results.
Can anyone suggest where to look for an explanation, or what to do to reduce the variation?
uname:
Linux perf2.basistech.net 2.6.32-71.29.1.el6.x86_64 #1 SMP Mon Jun 27 19:49:27 BST 2011 x86_64 x86_64 x86_64 GNU/Linux
numactl –show:
~/ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
cpubind: 0 1
nodebind: 0 1
membind: 0 1
~/ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 6144 MB
node 0 free: 2030 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 6134 MB
node 1 free: 144 MB
node distances:
node 0 1
0: 10 20
1: 20 10