I am transitioning to parallel STL from my own thread pool. One thing in need is efficient nested parallelism. Consider the following code
std::for_each(std::execution::par_unseq, begin, end, f(i));
where f()
may have the following definition:
void f(auto i)
{
std::for_each(std::execution::par_unseq,
i->begin(), i->end(), g(j));
}
The above is probably inefficient in most cases due to nested parallelism.
To improve it, I intend to wrap for_each()
and many other parallel STL algorithms (reduce()
, transform_reduce()
, any_of()
…) in some fashion similar to the following:
namespace tp {
bool isInParaEnv = false;
int maxThread = tbb::this_task_arena::max_concurrency();
/* some other control variables... */
struct activate
{
activate(int maxCore = tbb::this_task_arena::max_concurrency())
{
maxThread = std::max(1, std::min(
maxCore, tbb::this_task_arena::max_concurrency()));
isInParaEnv = false;
/* some code... */
}
~activate() { /* some code... */ }
};
void for_each(auto begin, auto end, auto && f)
{
if (maxThread == 1 or isInParaEnv or std::next(begin) == end)
{
std::for_each(begin, end, f);
return;
}
isInParaEnv = true;
std::for_each(std::execution::par_unseq, begin, end, f);
isInParaEnv = false;
}
}
This allows me to be somewhat “oblivious” about nested parallelism using a unified loop interface. For example, f()
can now be written as:
void f(auto i)
{
/* some code... */
tp::for_each(i->begin, i->end, [](auto j)->void {/* some code... */});
/* some code... */
}
And the exported API can be as simple as:
int api()
{
tp::activate();
/* Some code... */
tp::for_each(begin, end, f);
/* Some code... */
}
However, the above design still has some shortcomings. For instance, if the machine has 16 threads and if the outermost loop takes only 2 iterations (begin + 2 == end
), any inner loop will not invoke the parallel algorithm, and thus the parallel efficiency will be 2 at most.
How to improve the above design, hopefully without canceling the usage of parallel STL, so that we can be “oblivious” about nested parallelism while entertaining no more than std::thread::max_concurrency()
threads during the run?
Thanks!