CUDA kernel function parameters are passed to the device through constant memory and have been limited to 4,096 bytes. CUDA 12.1 increases this parameter limit from 4,096 bytes to 32,764 bytes on all device architectures including NVIDIA Volta and above. Before CUDA 12.1, passing kernel arguments exceeding 4,096 bytes required working around the kernel parameter limit by copying excess arguments into constant memory with cudaMemcpyToSymbol.
Original article
The suggested scenario works well when we work with only default stream because access to constant memory is serialized. But imagine we launch two kernels concurently in two different streams and we need two sets of Large Kernel Parameters for each kernel respectivly, so we need partition constant memory some way both sets of parameters located in constant memory and not interleave, because constant memory is shared between this two kernels. Sync access to constant memory is another layer of complexity, correct? This way is different from default kernel parameters, which is also allocated in constatnt memory but runtime automatically allocated different, not interleaved memory banks for each pack of parameters, in my understanding.
Are there practical ways to use this way of passing Large Kernel Parameters to concurrent kernels launched in different streams?