I am working on a project in CUDA that needs to make careful use of global memory. In my project, I have placed all calls to cudaMalloc
into an allocator / vector so to implement RAII. This means that, when debugging, I can carefully track the exact amount of memory that I am using.
However, I recently noticed that my program ends up using a lot more memory than I am calculating. I have been careful to ensure that I am not allocating more global memory than I think, and I am confident that this is not the case.
I have included the above image to show the total amount of global memory used by the program over time. The blue curve is the report I get from instrumenting all global allocations and deallocations, and the red curve is what I get when I check using nvidia-smi
during runtime. The horizontal axis is the execution time.
You’ll notice that around the 150 mark on the horizontal axis that there is a jump in the amount of global memory used, but otherwise the trends are virtually identical (note that the red curve was not sampled as frequently as the blue one).
I traced this jump in memory usage to a function call in a kernel. The code is far too complicated to post here, but the following example should capture what’s happening.
I have a “recursive” function that happens deep within a kernel. The function looks something like this:
__device__ bool do_something(int i)
{
do_something_impl<64, 3>(i);
}
template <const int buf_size, const int lev>
__device__ bool do_something_impl(int i)
{
std::size_t buf[buf_size];
//do some calculations on buf
bool condition = buf[1] > buf[0]; //example
if (!condition)
{
if constexpr (lev >= 0)
{
return do_something_impl<2*buf_size, lev - 1>(i);
}
else
{
return false; //(error here)
}
}
else
{
return false;
}
}
Of course, this is not the actual function but illustrates the idea. I put “recursive” in quotes because, while the implementation is recursive, templating is used to ensure that the stack size is known at compile time. The values <64, 3>
are chosen in the hopes that it provides sufficient stack space. I have to store these values in the stack for reasons I don’t care to elaborate on.
I suspect the CUDA runtime is secretly allocating some global memory on the back end to make space for all the stack memory required for this kernel.
My question is:
- is the jump in global memory possibly coming from this situation?
- is there a way to clear the global memory used for this purpose if this is the case?