I am attempting to run a training job on my slurm cluster.
When running the train.sh with sbatch, i get the following error:
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 81000
How to fix this?
The three scripts used are : 1. train.sh, 2. distributed_pretrain.sh and 3. train_pretrain.py.
- train.sh
-
distributed_pretrain.sh
-
train_pretrain.py (logic dealing with gpu allocation)
2