Cupy comes with an internal benchmarking function
from cupyx.profiler import benchmark
def my_func(a):
return cp.sqrt(cp.sum(a**2, axis=-1))
a = cp.random.random((256, 1024))
print(benchmark(my_func, (a,), n_repeat=20))
with an output
my_func : CPU: 44.407 us +/- 2.428 (min: 42.516 / max: 53.098) us GPU-0: 181.565 us +/- 1.853 (min: 180.288 / max: 188.608) us
Both benchmark appear to be very small in this example, but another RawKernel showed much higher values (~CPU: 50000 us) and (GPU-0: ~ 50000 us) for a very similiar operation.
What do long CPU and long GPU times indicate?