I am writing a CUDA kernel with an inner loop that looks roughly like this:
for (int i = 0; i < NUM_ITERATIONS; i++)
{
// read global memory, write shared memory
__syncthreads();
// read shared memory, do math
__syncthreads();
}
For performance, I want to minimize the total amount of time threads spend waiting for other threads to arrive at the barrier synchronization. Will the number of threads per block affect the average amount of time a thread spends waiting at the barrier? Total amount of time all threads spend waiting? What if I have a low occupancy kernel (i.e. lots of registers per thread, low # of threads per block), are there any strategies that can help reduce synchronization waiting in this case?