I am optimizing CUDA kernel that processes image and amount of work that is made for each source pixel changes during the process. For example, I have a 2D loop at some place, something like
float sum = 0;
for (int i = 0; i < 5; i++)
for (int j = 0; j < 5; j++)
sum += calc_value(x + i * alpha, y + j *beta, vec, something_else);
and less intensive fragments like
Vec3 vec = calc_vector(x, y, something_else_2);
The code is actually much more complicated and include outer loops and gathering of results from such 2D loops, but generally this is something like
__global__ process_pixel(Data data)
{
int x = threadIdx.x;
int y = threadIdx.y;
Vec3 results[8][10];
for (int delta = 0; delta < 8; delta++)
for (int stage = 0; stage < 10; stage++)
{
Vec3 vec = calc_vector(x, y, stage);
Vec3 vec2 = calc_vector2(x, y, delta);
float sum = 0;
for (int i = 0; i < 5; i++)
for (int j = 0; j < 5; j++)
sum += calc_value(x + i * alpha, y + j * beta, vec, vec2, delta, stage);
results[delta][stage] = func(sum);
}
find_best(results); // And so on
}
Indeed a lot of memory is used by each thread, actually much more than I have shown. And so there is a problem with lack of registers. And there are calculations that can be optimized because they do not depend from delta or from stage or from i, but this requires even more memory.
So I decided to regroup threads so each group of 5 threads worked in parallel on one pixel. Thus we could spread the required memory among them. So instead of e.g. 16 * 8 threads each processing one pixel I started to use 16 * 10 thread blocks and added additional outer loop by pixels, so when we enter for (int delta = 0; delta < 8; delta++)
we are going to process only 16 * 2 pixels together. This gave some improvement in memory and compute GPU subsystems occupancy, but they are still far from optimal, like 40%. And another problem has arose is that I need to calculate some things with only part of threads or use a lot of shared memory:
int tid = threadIdx.x + threadIdx.y * blockDim.x;
for (int delta = 0; delta < 8; delta++)
for (int stage = 0; stage < 10; stage++)
{
__shared__ Vec3 vecs[block_size / 5];
if (tid < block_size / 5)
vecs[tid] = calc_vector(x, y, stage);
__syncthreads();
__shared__ float sums[block_size / 5];
float sum = 0;
int i = threadIdx.y;
for (int j = 0; j < 5; j++)
sum += calc_value(x + i * alpha, y + j *beta, vec, vec2, delta, stage);
atomicAdd(&(sums[threadIdx.x]), sum); // Another bad thing, but let us not discuss it now
...
Although generally the approach looks logical, it looks like this processing with only 32 threads out of 160 and __syncthreads()
produces big performance bottleneck. At one code fragment I even got worse results with such approach in comparison to calculation of the same with all threads. I thought it would not be a problem that 128 threads are idle because we should have many other working thread blocks in parallel (I was able to get theoretical occupancy like 80%). But it looks like it is a problem (although I could make mistake during experiments and profiling) and also the assembly code gets a lot of shared stores and loads in comparison to “not optimized” code in which there are many FFMULs and that’s all.
And then the fact arises that shared memory is even smaller than registers memory… If we do not take GF 4090, we have e.g. 96 KB. So if we want 2048 threads per SM we need to limit ourselves with 12 float values per thread. Even less than 32 registers…
So the question is which other approaches could be applied here? Something simple and effective… I see the following ways. Since we already have variant 1 – simply calculate everything by all threads and 2 – calculate with small subset of threads, this will be 3 and 4.
-
Use bigger shared memory buffer and precalculate things with all threads, so
__shared__ Vec3 vecs[5][block_size / 5]; for (int stage = 0; stage < 10; stage++) { if (stage == 0 || stage == 5) { vecs[...][...] = calc_vector... __syncthreads(); } ...
-
Try hard to parallelize
calc_vector
too like we did above withfor (int i = 0; i < 5; i++)
, to spread results in local variables and then exchange them among neighbor threads with__shfl_up_sync
and so on. Looks too monstrous.
Any other ideas?