I have a python program designed to work with large data, but I would also like to achieve performance gains with relatively small data. The program uses numba.cuda for GPU computing. As I found out, most of the time with small data is spent on calling the @cuda.jit or @cuda.reduce function itself. I can’t use asynchronous calls and I can’t reduce the number of calls. So I’m wondering if it’s possible to somehow reduce the time it takes to call a function? All the data is already on the GPU, functions are compiled, and, to be honest, I’m surprised at how long it takes to call the function.
Here’s some code that illustrates what I’m talking about.
import numpy as np
import time
from numba import cuda
@cuda.reduce
def CudaMax(a,b):
return max(a,b)
N=1000
A = np.random.uniform(low=-10000, high=10000, size=(N))
A_device = cuda.to_device(A)
CudaMax(A_device,init=np.finfo(np.float32).min) #warm up
ts=time.time()
for i in range(1000):
m=CudaMax(A_device,init=np.finfo(np.float32).min)
print("size:",N,"time:",time.time()-ts)
N=10000000
A = np.random.uniform(low=-10000, high=10000, size=(N))
A_device = cuda.to_device(A)
ts=time.time()
for i in range(1000):
m=CudaMax(A_device,init=np.finfo(np.float32).min)
print("size:",N,"time:",time.time()-ts)
Result:
size: 1000 time: 0.7654790878295898
size: 10000000 time: 0.9265458583831787
As you can see, the time in my example almost does not depend on the size of the array, which means the function call itself takes quite a long time – about 0.7ms. So, is it possible to reduce this time? Thank you in advance for your answers and help!