In my application I perform the same computation on batches of problems, I do however require some intermediate data to be allocated for these computations and therefore I’ve resorted to function objects which allocate memory based on specific meta data of my batches. (Rather than static memory inside functions or the memory-handle idiom employed by other APIs)
In the interest of not allocating excessive amounts of intermediate memory I’m wondering whether it is possible to query the maximum concurrent kernels that can be executed on a specific device. This would then allow me to only allocate intermediate memory for that many problems, and cycle through the memory like this for my problems: memory[i%maxConcurrentKernels]
For Nvidia GPUs I know this number can be found in the CUDA C++ Programming Guide for modern GPUs this number is 128, I do not see how even in CUDA this number can be queried from e.g. their cudaDeviceProp struct.
Ideally I would like to be able to query this information for any GPU, and probably just default to number of cores for CPU devices. I know I can query the number of streaming multiprocessors in SYCL, but I would prefer to use the maximum concurrent kernels number.
If the number can’t be queried from SYCL I would appreciate if someone could point me in the direction of the equivalent spec for AMD and Intel GPUs, then I’ll write some custom heuristics.
Jonas la Cour is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.