Thiết kế website giá rẻ

Question

I am transitioning to parallel STL from my own thread pool. One thing in need is efficient nested parallelism. Consider the following code

<code>std::for_each(std::execution::par_unseq, begin, end, f(i));

</code>

<code>std::for_each(std::execution::par_unseq, begin, end, f(i)); </code>

std::for_each(std::execution::par_unseq, begin, end, f(i));

where f() may have the following definition:

<code>void f(auto i)

{

std::for_each(std::execution::par_unseq,

i->begin(), i->end(), g(j));

}

</code>

<code>void f(auto i) { std::for_each(std::execution::par_unseq, i->begin(), i->end(), g(j)); } </code>

void f(auto i)
{
  std::for_each(std::execution::par_unseq, 
    i->begin(), i->end(), g(j));
}

The above is probably inefficient in most cases due to nested parallelism.

To improve it, I intend to wrap for_each() and many other parallel STL algorithms (reduce(), transform_reduce(), any_of()…) in some fashion similar to the following:

<code>namespace tp {

bool isInParaEnv = false;

int maxThread = tbb::this_task_arena::max_concurrency();

/* some other control variables... */

struct activate

{

activate(int maxCore = tbb::this_task_arena::max_concurrency())

{

maxThread = std::max(1, std::min(

maxCore, tbb::this_task_arena::max_concurrency()));

isInParaEnv = false;

/* some code... */

}

~activate() { /* some code... */ }

};

void for_each(auto begin, auto end, auto && f)

{

if (maxThread == 1 or isInParaEnv or std::next(begin) == end)

{

std::for_each(begin, end, f);

return;

}

isInParaEnv = true;

std::for_each(std::execution::par_unseq, begin, end, f);

isInParaEnv = false;

}

</code>

<code>namespace tp { bool isInParaEnv = false; int maxThread = tbb::this_task_arena::max_concurrency(); /* some other control variables... */ struct activate { activate(int maxCore = tbb::this_task_arena::max_concurrency()) { maxThread = std::max(1, std::min( maxCore, tbb::this_task_arena::max_concurrency())); isInParaEnv = false; /* some code... */ } ~activate() { /* some code... */ } }; void for_each(auto begin, auto end, auto && f) { if (maxThread == 1 or isInParaEnv or std::next(begin) == end) { std::for_each(begin, end, f); return; } isInParaEnv = true; std::for_each(std::execution::par_unseq, begin, end, f); isInParaEnv = false; } } </code>

namespace tp {

bool isInParaEnv = false;
int maxThread = tbb::this_task_arena::max_concurrency();
/* some other control variables... */

struct activate
{
  activate(int maxCore = tbb::this_task_arena::max_concurrency())
  {
    maxThread = std::max(1, std::min(
      maxCore, tbb::this_task_arena::max_concurrency()));
    isInParaEnv = false;
    /* some code... */
  }
  ~activate() { /* some code... */ }
};

void for_each(auto begin, auto end, auto && f)
{
  if (maxThread == 1 or isInParaEnv or std::next(begin) == end) 
  { 
    std::for_each(begin, end, f);
    return; 
  }
  isInParaEnv = true;
  std::for_each(std::execution::par_unseq, begin, end, f);
  isInParaEnv = false;
}
}

This allows me to be somewhat “oblivious” about nested parallelism using a unified loop interface. For example, f() can now be written as:

<code>void f(auto i)

{

/* some code... */

tp::for_each(i->begin, i->end, [](auto j)->void {/* some code... */});

/* some code... */

}

</code>

<code>void f(auto i) { /* some code... */ tp::for_each(i->begin, i->end, [](auto j)->void {/* some code... */}); /* some code... */ } </code>

void f(auto i)
{
  /* some code... */
  tp::for_each(i->begin, i->end, [](auto j)->void {/* some code... */});
  /* some code... */
}

And the exported API can be as simple as:

<code>int api()

{

tp::activate();

/* Some code... */

tp::for_each(begin, end, f);

/* Some code... */

}

</code>

<code>int api() { tp::activate(); /* Some code... */ tp::for_each(begin, end, f); /* Some code... */ } </code>

int api()
{
  tp::activate();
  /* Some code... */
  tp::for_each(begin, end, f);
  /* Some code... */
}

However, the above design still has some shortcomings. For instance, if the machine has 16 threads and if the outermost loop takes only 2 iterations (begin + 2 == end), any inner loop will not invoke the parallel algorithm, and thus the parallel efficiency will be 2 at most.

How to improve the above design, hopefully without canceling the usage of parallel STL, so that we can be “oblivious” about nested parallelism while entertaining no more than std::thread::max_concurrency() threads during the run?

Thanks!

Thiết kế website giá rẻ

Danh mục

Efficient nested parallelism