CUDA Graph Execution Taking Longer Than Original Kernel Launch Loop
I have a loop where I launch multiple kernels with interdependencies using events and streams.
How to Use CUDA Graphs with Interdependent Streams and Dynamic Parameters?
I have a CUDA program with multiple interdependent streams, and I want to convert it to use CUDA graphs to reduce launch overhead and improve performance. My program involves launching three kernels (kernel1
, kernel2
, and kernel3
) across two streams (stream1
and stream2
), with dependencies managed using CUDA events (event1
and event2
). The parameters for these kernels are dynamic and need to be updated at each iteration.