I currently have a C/CUDA program which uses multiple CPUs to generate values in parallel to be passed to kernel functions which run on a GPU.
I see two general ways of handling the workflow:
- Divide the work among CPUs, then within each single CPU context, run the GPU kernel function
- Divide the work among CPUs, then, once all CPUs have completed their work, consolidate all values into a single array, and pass to GPU.
pretheoretically, is one of these options clearly more efficient than the other? Or is the answer case-specific? If case-specific, what are the relevant parameters most determinant of efficiency?
thank you for your time considering this question
2