Approaches to optimize “tight” code fragments in CUDA code
I am optimizing CUDA kernel that processes image and amount of work that is made for each source pixel changes during the process. For example, I have a 2D loop at some place, something like
Approaches to optimize “tight” code fragments in CUDA code
I am optimizing CUDA kernel that processes image and amount of work that is made for each source pixel changes during the process. For example, I have a 2D loop at some place, something like