I am trying to measure at which workloads’ pthreads become useful. So far, I found that workloads need to take around ~3ms for pthreads to contribute positively to the overall progress (on an Alderlake test system).
Is this the correct order of magnitude?
The benchmarking output is below:
BM_dispatch<dispatch>/16/process_time/real_time 1.37 ms 1.37 ms 513
BM_dispatch<dispatch>/32/process_time/real_time 2.75 ms 2.75 ms 252
BM_dispatch<dispatch>/48/process_time/real_time 4.15 ms 4.15 ms 169
BM_dispatch<dispatch>/64/process_time/real_time 5.52 ms 5.52 ms 126
BM_dispatch<dispatch>/80/process_time/real_time 6.89 ms 6.89 ms 101
BM_dispatch<dispatch>/96/process_time/real_time 8.26 ms 8.26 ms 84
BM_dispatch<dispatch>/112/process_time/real_time 9.62 ms 9.62 ms 72
BM_dispatch<dispatch_pthread>/16/process_time/real_time 2.16 ms 4.18 ms 359
BM_dispatch<dispatch_pthread>/32/process_time/real_time 3.76 ms 7.38 ms 200
BM_dispatch<dispatch_pthread>/48/process_time/real_time 3.67 ms 7.18 ms 150
BM_dispatch<dispatch_pthread>/64/process_time/real_time 4.30 ms 8.44 ms 163
BM_dispatch<dispatch_pthread>/80/process_time/real_time 4.38 ms 8.60 ms 176
BM_dispatch<dispatch_pthread>/96/process_time/real_time 4.93 ms 9.69 ms 146
BM_dispatch<dispatch_pthread>/112/process_time/real_time 5.31 ms 10.5 ms 126
I benchmark two functions dispatch
and dispatch_pthread
under different workload sizes. The function do the same total work, but dispatch_pthreads
divides the work among two threads. When the runtime is about ~1 ms, pthreads are not beneficial. Around a workload of ~8ms, two pthreads are about twice as fast as a single thread.
The full program is below:
void find_max(const float* in, size_t eles, float* out) {
float max{0};
for (size_t i = 0; i < eles; ++i) {
if (in[i] > max) max = in[i];
}
*out = max;
}
float dispatch(const float* inp, size_t rows, size_t cols, float* out) {
for (size_t row = 0; row < rows; row++) {
find_max(inp + row * cols, cols, out + row);
}
}
struct pthreadpool_context {
const float* inp;
size_t rows;
size_t cols;
float* out;
};
void* work(void* ctx) {
const pthreadpool_context* context = (pthreadpool_context*)ctx;
dispatch(context->inp, context->rows, context->cols, context->out);
return NULL;
}
float dispatch_pthread(const float* inp, size_t rows, size_t cols, float* out) {
pthread_t thread1, thread2;
size_t rows_per_thread = rows / 2;
const pthreadpool_context context1 = {inp, rows_per_thread, cols, out};
const pthreadpool_context context2 = {inp + rows_per_thread * cols,
rows_per_thread, cols,
out + rows_per_thread};
pthread_create(&thread1, NULL, work, (void*)&context1);
pthread_create(&thread2, NULL, work, (void*)&context2);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
}
template <auto F>
void BM_dispatch(benchmark::State& state) {
std::random_device rnd_device;
std::mt19937 mersenne_engine{rnd_device()};
std::normal_distribution<float> dist{0, 1};
auto gen = [&]() { return dist(mersenne_engine); };
const size_t cols = 1024 * state.range(0);
constexpr size_t rows = 100;
std::vector<float> inp(rows * cols);
std::generate(inp.begin(), inp.end(), gen);
std::vector<float> out(rows);
for (auto _ : state) {
F(inp.data(), rows, cols, out.data());
}
}
BENCHMARK(BM_dispatch<dispatch>)
->DenseRange(16, 112, 16)
->MeasureProcessCPUTime()
->UseRealTime()
->Unit(benchmark::kMillisecond);
BENCHMARK(BM_dispatch<dispatch_pthread>)
->DenseRange(16, 112, 16)
->MeasureProcessCPUTime()
->UseRealTime()
->Unit(benchmark::kMillisecond);
BENCHMARK_MAIN();
The program gets compiled with O2
optimization with gcc 13.2.0 on Ubuntu 22.04 with kernel 5.15.
1