I am making a call to the cublasSasum
function only once from cuBLAS. I do see that the actual kernel it calls (asum_kernel
) is called twice as seen from profiling via nsys
. I am computing a sum for a total of 4096^2 elements.
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- -------------------------------------------------------------
17.9 415,655 2 207,827.5 207,827.5 3,264 412,391 289,296.5 void asum_kernel<int, float, float>(cublasAsumParams<T2, T3>)
Is the call made twice because the kernel performs block level reductions first, saves the result, and performs the final reduction on these temporary results?
PS : I tried the same with just 4 elements with 4 threads and only one grid block. Still the asum_kernel
is called twice though one is sufficient in this case?!?
1