I have a kernel which is supposed to generate statistics on a specific simulation by generating samples. I need a (very, very) large number of samples (possibly needing kernel to run for hours). I have allocated a working space (an array) for a number of samples N (which is less than the total number of samples) and reuse space by resetting the the array index to a seed.
__global__ void batch_kernel(float* arr, int N, long total_operations, int *result) {
long long idx = threadIdx.x + blockIdx.x * blockDim.x;
// Shared memory for the workspace (if needed, otherwise global memory can be used)
while (idx < total_operations) {
int workspace_idx = idx % N; // Reuse workspace slots
// Load data/reset into the workspace index (default seed)
// Synchronize threads within the block to ensure workspace is loaded
__syncthreads();
// Perform some computation on the workspace
// [SOME COMPUTATION AND MODIFICATION WITH A RESULT]
// Add the value to the result (example of storing results)
atomicAdd(result, value);
// Synchronize threads within the block to ensure workspace is not overwritten prematurely
__syncthreads();
// Move to the next set of operations
idx += blockDim.x * gridDim.x; // Using a grid stride fashion, have no idea why
}
}
Now going to the main,
int main() {
const int N = 1024; // Size of the workspace array
const long total_operations = INT_MAX; // Total number of operations
// Host memory allocations
std::vector<float> h_arr(N, 1.0f); // Example initialization
int h_result = 0.0f;
// Device memory allocations
float* d_arr;
int* d_result;
cudaMalloc(&d_arr, N * sizeof(float));
cudaMalloc(&d_result, sizeof(long));
// Copy data to device
cudaMemcpy(d_arr, h_arr.data(), N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_result, &h_result, sizeof(float), cudaMemcpyHostToDevice);
// Kernel launch parameters
int threadsPerBlock = 256;
int blocks = (total_operations + threadsPerBlock - 1) / threadsPerBlock;
// Launch the kernel
batch_kernel<<<blocks, threadsPerBlock>>>(d_arr, N, total_operations, d_result);
// Copy results back to host
cudaMemcpy(&h_result, d_result, sizeof(float), cudaMemcpyDeviceToHost);
// Print the result
std::cout << "Result = " << h_result << std::endl;
// Free device memory
cudaFree(d_arr);
cudaFree(d_result);
return 0;
}
You can see I do a 2147483647 operations using only 1024*sizeof(float) memory.
Is there a better way to this than the idx % N
hack? What if I do not know total_operations
in advance, and have to set the kernel to run indefinitely until a convergence result is reached. How can I do that instead?
From what I have read, persistent kernels are the way to implement this paradigm/framework safely in CUDA, but since CUDA 9 it seems persistent kernels should use cooperative groups.
If someone wonders what the question is/what is it I demand, it would be very kind if a MWE (minimum working example) of a paradigm which does something similar to my aforementioned kernel (i.e. implementing a persistent kernel using cooperative groups with similar specifications to the kernel I currently have).
PS : A very necessary reminder on Stackoverflow to be kind and to introspect one’s tone to not be condescending (even unintentionally) when answering questions, to carry the assumption of good faith whenever possible, and to not to confuse the necessity to be precise without being aggressive. I would rather prefer my question left unanswered than having a negative comment/interaction to engage with.