As object question, I am trying to refine lines of code in bash to start a job where I require 5 nodes with 1 gpu for each (and thus 1 task per node) in order to start a cross validation with 5 folds in parallel.
My lines of code for the moment look like this :
#!/bin/bash
#SBATCH -A <account>
#SBATCH -p <partition>
#SBATCH --time 5:00:00
#SBATCH -N 5
#SBATCH --gres=gpu:1
#SBATCH --mem=50000
#SBATCH --job-name=<jobname>
#SBATCH --error=file.err
#SBATCH --output=file.out
#SBATCH --ntasks-per-node=1
for fold in {0..4}; do
srun -N1 -n1 --gres=gpu:1 --exclusive bash file.sh --fold ${fold} --device cuda:0 &
done
wait
From this job I have some doubts.
Does the srun command actually start commands on the available nodes ?
The available cuda index on the node is sometimes not always 0 and thus does not start one of the folds in parallel. How is this possible ?
I hope the question is clear and any corrections are welcome