I’m currently working on a ray tracing algorithm for a non-image rendering application, utilizing both CUDA and OpenCL for GPU acceleration. My algorithm processes more than 1 million rays, and I’m curious about how these rays are partitioned among threads for efficient computation.
In CUDA, I’m launching the kernel function as follows:
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, GPU_trace_rays, 0, raySize);
gridSize = (raySize + blockSize - 1) / blockSize;
GPU_trace_rays<<<gridSize, blockSize>>>(...);
And calculating the thread ID as:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
Similarly, in OpenCL, I’m using:
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(chunk_ray_size), cl::NullRange);
And calculating the thread ID as:
int tid = get_global_id(0);
Despite not explicitly assigning multiple rays to a thread, I’ve observed correct results with operations like:
int arrayC[threadID] = arrayA[threadID] + arrayB[threadID];
Could someone help me understand how CUDA and OpenCL handle the partitioning of rays among threads? I’m particularly interested in how these frameworks efficiently distribute the workload across the available threads, especially when dealing with large numbers of rays since I sometimes I have more rays than available threads.