I’m using OpenCL in c++. My GPU is NVIDIA GeForce RTX 3070.
I have a very simple kernel
__kernel void op_exp_f(global float* vOut)
{
const uint i = get_global_id(0);
vOut[i] = exp(vOut[i]);
}
I build my kernel, allocate GPU memory (with clCreateBuffer CL_MEM_READ_WRITE) and run the following method
inline void op_exp(cl_mem buffer, size_t n)
{
auto& instance = gpuInstance();
auto& kernel = instance._kernel_op_exp;
clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffer));
auto queue = instance.GetCommandQueue();
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &n, NULL, 0, NULL, NULL);
clFinish(queue);
}
I measured the time to run op_exp
.
When the size is between 1 and 130,000 the execution time is roughly constant at 20µs.
I find it pretty high knowing that this time doesn’t include building the kernel, allocating memory or transferring memory.
When the size goes from 130,000 to 16,000,000 I see that the time is linear.
What can I do to reduce this 20µs fix cost ?
(Question edited, I wrote initially that the fix cost was 20ms instead of 20 µs)
7