I am running a finetune experiment with 8 GPUs and the nvidia-smi command gives me the following output
Mon Aug 19 12:16:17 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:10:1C.0 Off | 0 | | N/A 61C P0 118W / 400W | 6223MiB / 40960MiB | 4% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:10:1D.0 Off | 0 | | N/A 52C P0 88W / 400W | 9153MiB / 40960MiB | 14% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:20:1C.0 Off | 0 | | N/A 64C P0 112W / 400W | 9153MiB / 40960MiB | 14% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:20:1D.0 Off | 0 | | N/A 53C P0 93W / 400W | 9153MiB / 40960MiB | 21% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-40GB On | 00000000:90:1C.0 Off | 0 | | N/A 62C P0 103W / 400W | 9153MiB / 40960MiB | 21% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-40GB On | 00000000:90:1D.0 Off | 0 | | N/A 53C P0 116W / 400W | 9153MiB / 40960MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-40GB On | 00000000:A0:1C.0 Off | 0 | | N/A 65C P0 388W / 400W | 9191MiB / 40960MiB | 6% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-40GB On | 00000000:A0:1D.0 Off | 0 | | N/A 44C P0 83W / 400W | 423MiB / 40960MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 198 C /usr/bin/python3 0MiB |
| 1 N/A N/A 198 C /usr/bin/python3 0MiB |
| 2 N/A N/A 198 C /usr/bin/python3 0MiB |
| 3 N/A N/A 198 C /usr/bin/python3 0MiB |
| 4 N/A N/A 198 C /usr/bin/python3 0MiB |
| 5 N/A N/A 198 C /usr/bin/python3 0MiB |
| 6 N/A N/A 198 C /usr/bin/python3 0MiB |
| 7 N/A N/A 198 C /usr/bin/python3 0MiB |
+---------------------------------------------------------------------------------------+
Though the GPU utils are more than 0% but the GPU memory usage for all 8 GPUs are 0. I am not able to understand why will the memory usage be 0 when the GPU utilisation is going on?
3