If I write (specifically recent x86 but interesting more generally) to RAM in “batches” with the same size as a cache line (e.g. an AVX512 vector) to memory that is gurenteed to not be read or written by any other thread (so no contention).
Would I expect the same performance writing to random locations as for writing sequentially?
I was originally expecting so (since the writes to memory have no dependency, have cache line size and should be able to be executed in parallel with arithmetic operations by the cpu). However, my tests seems to indicate otherwise.
Here is an example of code I run on a saphire rapids to test the write performance of sequential avx512 locations vs “ranomd” ones which gives me about 500ms per loop for sequential but 1500ms for the random ones which suprised me a lot.
Trying to profile this it seem sas if the random one is much more “core bound” – which I fail to understand, I have the same dependicies in both code paths, the only remaining difference should be how ofthen memory adresses are sequential and not.
#![feature(portable_simd)]
use std::time::Instant;
use ::std::simd::Simd;
use rayon::prelude::*;
type V32 = Simd<f32, 16>;
fn main() {
let vec_size = 110000000;
let no_writes: usize = 1000_000_00;
let no_trials = 10;
let mut d = vec![];
d.resize_with(vec_size + 128, || V32::splat(0.0));
let mut sum: usize = 0;
println!("sequential random");
for _ in 0..no_trials {
// first the sequential writes
let mut u1: u32 = 33;
let mut u2: u32 = 9824;
let mut i = 0;
let mut v = V32::splat(1.0);
let inn = Instant::now();
for _ in 0..no_writes {
u1 = 36969 * (u1 & 65535) + (u1 >> 16);
u2 = 18000 * (u2 & 65535) + (u2 >> 16);
let j = (u1 << 16) + (u2 & 65535);
let j = (j as usize % vec_size) as usize;
let ii = i as usize % vec_size;
sum += j + ii;
v += V32::splat(1.0);
let iii = j / 10000 + ii / 1;
unsafe { *d.get_unchecked_mut(iii % vec_size) = v };
// d[j] = v;
i += 1;
}
let t = inn.elapsed();
print!("{:?} t", t);
// // next the "random" one, we should be memory restrained
// // so the few integer calculations should not matter
let mut u1: u32 = 33;
let mut u2: u32 = 9824;
let mut i = 0;
let mut v = V32::splat(1.0);
let inn = Instant::now();
for _ in 0..no_writes {
u1 = 36969 * (u1 & 65535) + (u1 >> 16);
u2 = 18000 * (u2 & 65535) + (u2 >> 16);
let j = (u1 << 16) + (u2 & 65535);
let j = (j as usize % vec_size) as usize;
let ii = i as usize % vec_size;
sum += j + ii;
v += V32::splat(1.0);
let iii = j / 1 + ii / 10000;
unsafe { *d.get_unchecked_mut(iii % vec_size) = v };
// d[j] = v;
i += 1;
}
let t = inn.elapsed();
println!("{:?}", t);
}
// ensure that everything is not optimized away
let mut v = 0.0;
for i in 0..vec_size {
v += d[i][i % 16];
}
println!("");
println!("{} {}", v, sum);
}
16