I just setup a Alienware R16 for CNN model training. The gaming PC has 16GB/1TB SSD and Nvidia RTX 3070 Super 12GB vRAM. CUDA 12.6 and tensor flow 2.16.1 (no tensorRT installed) were installed on the PC.
There was no error when training a simple model on the ubuntu 24 PC. However when training a model with some complexity, the training process was killed without any message right after starting first epoch. Here is the output of Nvidia-smi:
Here cvcc output:
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:34:21_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0
But I was able to train the same model on MacBook Air M1 with only 8GB RAM. The total memory consumption on Mac is about 27GB. Sometime I have to reboot the Mac to clean all out before starting the training.
I guess the problem is the system run out of memory. The PC with RTX 3070 is much more powerful than the MacBook Air I have and not sure how it manages to run out of memory. Is there a way to tell the training process on ubuntu 24 to use swap if run out of memory? Or other ways to solve the problem.