I have a loop where I launch multiple kernels with interdependencies using events and streams.
Here’s the original loop without CUDA graphs:
for (int i= 1; i<= 1024 ; i++) {
// origin stream
kernel1<<<1,512,0,stream1>>>(i , /*params*/);
// fork into stream 2
cudaEventRecord(event1, stream1);
cudaStreamWaitEvent(stream2, event1, 0);
kernel2<<<1,512,0,stream1>>>(i , /*params*/);
kernel3<<<gridDim, blockDim, 0, stream2>>>(i , /*params*/);
// join stream 2 back to origin
cudaEventRecord(event2, stream2);
cudaStreamWaitEvent(stream1, event2, 0);
}
To decrease the overhead of multiple kernel launches, I chose to use CUDA graphs.
I have dynamic parameters for the kernel , and was confused how to capture inter-dependent streams with dynamic parameters inside a kernel and posted a question on this.
Based on suggestions from comments, I captured the entire loop in a CUDA graph.
Here is the graph code with start and stop events included:
CUDA_CHECK(cudaEventRecord(start, stream1));
CUDA_CHECK(cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal));
for (int i= 1; i<= 1024 ; i++)
{
// origin stream
kernel1<<<1,512,0,stream1>>>(i , /*params*/);
// fork into stream 2
cudaEventRecord(event1, stream1);
cudaStreamWaitEvent(stream2, event1, 0);
kernel2<<<1,512,0,stream1>>>(i , /*params*/);
kernel3<<<gridDim, blockDim, 0, stream2>>>(i , /*params*/);
// join stream 2 back to origin
cudaEventRecord(event2, stream2);
cudaStreamWaitEvent(stream1, event2, 0);
}
CUDA_CHECK(cudaStreamEndCapture(stream1, &graph));
CUDA_CHECK(cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0));
CUDA_CHECK(cudaGraphLaunch(graphExec, stream1));
// Record the stop event before synchronizing
CUDA_CHECK(cudaEventRecord(stop, stream1));
CUDA_CHECK(cudaEventSynchronize(stop));
However, this approach is taking much more time than the original loop. When I used Nsight Systems to profile the application, it showed that cudaEventSynchronize(stop)
is taking a significant amount of time.
What could be causing this increased time, and how can I optimize the graph execution to reduce the synchronization time?
Image from Nsight Systems
Photos is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.