I am experiencing an issue with memory increase and saturation while training a deep learning model using PyTorch in WSL2. While it doesn’t happen on a Linux OS with the exact same code. The only thing that differs is the version of PyTorch, Cuda and the OS. I have tested this on two different setups:
Setup 1:
- OS: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python: 3.12.2
- PyTorch: 2.2.2
- Cuda :
nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.1.105
- Memory Utilization Profile: Memory increase during model training.
setup1
Setup 2:
- OS: Linux-5.15.0-1058-aws-x86_64-with-glibc2.31
- Python: 3.10.14
- PyTorch: 2.2.0
- Cuda: unknown
- Memory Utilization Profile: Memory does not increase during model training.
setup2
Issue:
With the exact same code, the memory utilization profile is different between the two setups. In Setup 1, memory saturates quickly during the training process, while in Setup 2, the memory usage remains stable and does not reach saturation.
Questions:
- Why might the memory utilization differ so significantly between these two setups despite using the same code?
- Are there any known issues with specific versions of PyTorch or particular Linux distributions that could cause such behavior?
- What can I do to ensure consistent memory utilization across different environments?
Steps Taken:
- Verified that the same code is used on both setups.
- Monitored memory usage using
- Compared library/python/cuda versions used in both environments.
Any insights or suggestions would be greatly appreciated. Thank you!
Sébastien Chapeland is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.