I am running a parallel process in slurm using python and tensorflow. I’ve generated a command file to be sourced with N lines, typically 20-100, each on running a tensorflow training. I’ve already got code to allocate gpus so I don’t need slurm to do that. I’m using sbatch to schedule the job as a job array so I can request chunks of a few hours at a time and then each new job array step will restart all the N trainings, typically running 50 steps of 3 hours each to train for about a week.
So in the command file I am prepending srun and whatever options I need. The desired result is to get them distributed evenly across the nodes, since I have exclusive access to the node once it’s allocated to my job. I don’t think I need slurm to worry about other resource allocation (cpu, gpu, memory), I just want to run all the processes in parallel on the node.
So I tried this for each line in the file:
srun --mem 0 --exclusive --nodes 1 --ntasks 1 <command> &
and all the commands run on the same node, even though man srun says “Cyclic distribution is the default behavior if the number of tasks is no larger than the number of allocated nodes.”
so then I tried this:
srun --mem 0 --overlap --nodes 1 --ntasks 1 --distribution=cyclic <command> &
but fhe first job went to node 0, the second to node 1 and all the rest went to node 0
so my workaround is to use
--mem 0 --overlap --nodes 1 --ntasks 1 --relative=_NODE_
on each line and then run
awk -i inplace -v nodes="$SLURM_JOB_NUM_NODES" '{gsub(/_NODE_/, (NR-1) % nodes); print}' <command_file>
on the command file to do the distribution explicitly. At least that works, but it seems like slurm should have a way to do this.