Should I expect more memory stores with more restrictive memory orders in C++?
For the release-acquire order, the cppreference says:
All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.
Based on this, I expect all modifications thread A made to be flushed to shared memory when an atomic counter is modified. Conversely, when I use a relaxed memory ordering, memory adjustments other than the atomic counter need not be written to shared memory.
To illustrate, I created the following program (godbolt)
#include <stddef.h>
#include <atomic>
#include <thread>
auto ORDER = std::memory_order::relaxed;
void find_max(const float* in, size_t eles, float* out) {
float max{0};
for (size_t i = 0; i < eles; ++i) {
if (in[i] > max) max = in[i];
}
*out = max;
}
work(const float* inp, std::atomic<int>& counter, size_t cols, float* out) {
int i{0};
while ((i = counter.fetch_sub(1, ORDER) - 1) > -1) {
find_max(inp + i * cols, cols, out + i);
}
}
void dispatch_thread(const float* inp, size_t rows, size_t cols, float* out) {
std::atomic<int> counter{(int)rows};
std::thread t1(work, inp, std::ref(counter), cols, out);
std::thread t2(work, inp, std::ref(counter), cols, out);
t1.join();
t2.join();
}
The two threads (t1 and t2) call the function work
. They read and decrement an atomic counter, find the maximum among several variables, and write that maximum to output storage.
I would expect the following:
- If I use
std::memory_order::relaxed
each thread keeps a local copy of out and only writescounter
to memory - If I use
std::memory_order::seq_cst
every change inout
is written to shared memory too.
However, I couldn’t observe this:
- The relevant assembly is the same, irrespective of memory order
- Profiling with
perf mem
didn’t show any substantial difference in the number of stores.
What’s wrong with my expectations? I wonder if I understand the memory ordering concepts wrong? Cppreference also says that On strongly-ordered systems — x86, SPARC TSO, IBM mainframe, etc. — release-acquire ordering is automatic for the majority of operations.
. So maybe I’m testing this on the wrong arch? However, I also looked at the compiled output on ARM and couldn’t see differences.