I’m working on the implementation of a few related algorithms in CUDA, all of which require a primitive that we’ll call f()
. The related algorithms can’t simply call f
though, as they require f
to have slightly different behavior for each algorithm.
f
is highly optimized feature though, so I don’t want to overload f
for each case because if I end up making changes to f
then I need to change each overload.
f
looks something like the following:
__device__ void f(const int *R, const int *C, const int n, int *d, int *Q, int *Q2)
{
for(int i=blockIdx.x; i<n; i+=gridDim.x)
{
//Cooperatively inspect R and C and place results into d
}
}
It’s actually way more complicated as it uses lots of __shfl
instructions and whatnot, but that shouldn’t be relevant here. The problem is that the related algorithms all need slightly different variations of f
but since f
is complicated I only want it’s code in one location.
Here are the requirements of the related algorithms:
A: An additional global device variable to keep track of the maximum of a set of data
B: An additional O(n) or O(n^2) array for recording integer data at various stages of f
C: Similar to B but recording binary data
D: Requires a stack and an array for recording data at various stages of f
(the array is the same as needed in B). Has other requirements too but those are better handled separately.
How can I avoid duplicating f
and make it flexible for these use cases?
2