Up until a few years ago, the following demonstrative CUDA code was perfectly workable:
__global__ void parent_kernel(int *output){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int Alice, Bob;
float *Charly, *Dave;
Alice = tid;
for(int j=0; j<100; ++j){
child_kernel1<<<1,512>>>(Charly);
cudaThreadSynchronize();
Alice=f(Alice,Charly);
child_kernel2<<<1,512>>>(Dave);
cudaThreadSynchronize();
Bob=g(Alice,Dave);
}
output[tid] = Bob;
}
//Later...
parent_kernel<<<32,32>>>(Ellie);
The for
-loop cannot be parallelized, because iteration j+1
depends on the result of j
, with presumably some horribly nonlinear functions being called each time.
To process a pair of large arrays, some child kernels are called from the device. The calling thread needs each and every one of the threads executing the child kernels to finish before it can evaluate the functions f
and g
. An approach is used from this still-available article on the NVIDIA website.
However, as stated in the CUDA programming guide, this is deprectated. A developer said “Yes, this will require refactoring your code. There is no zero-impact workaround that I am aware of”.
What is the new best practice with the most recent approach? Bear in mind that the example given is trivial; it could have any number of other functions, nested if
-statements and a dynamic number of child kernel threads.
One could imagine something like the following; call the parent kernel with this
parent_kernel<<<1024,512>>>(Ellie);
Inside it, the child kernel calls would be done away with, and the 512 threads in the block used to do those calculations directly. I am wondering how best to approach this, however. Should every thread in the block evaluate the following line,
Alice=f(Alice,Charly);
or should the result be passed around using shared memory?