I am working on training a MultiModal model in PyTorch. I have a training loop which runs just fine (albeit slowly) on my CPU (I tested up to batch size = 32). However, when I try to run it on a GPU (Tesla P40), it only works up to batch size = 2. With larger batch sizes it throws a torch.cuda.OutOfMemoryError. I am working with pre embedded video and audio, and pre tokenized text. Is it possible that the GPU can really not handle batch sizes larger than 2 or could there be something wrong in my code? Do you have any advice on how I might go about troubleshooting? I apologize for this simple question, it is my first time working with a GPU cluster. I am running this code on my university’s GPU cluster and have double checked that the GPU I am using is not being used by anyone else.
I tried to examine memory usage on both the CPU and the GPU. By running nvidia-smi, I found that each GPU has a memory limit of 23040 MiB. I used print(f'Allocated: {torch.cuda.memory_allocated() / 1024**2} MB') print(f'Cached: {torch.cuda.memory_reserved() / 1024**2} MB')
to track the memory usage of the GPU and found that after all data is loaded with a batch size of 8, only around 500 MB are allocated and around 2000 MB are cached. The error tends to occur when calling BERT to embed the text tokens, but may occur later with smaller batch sizes. I also double checked that all tensors are being properly loaded onto the GPU.
Cullen Anderson is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.