I have a program which doesn’t support multi threading. And I want to run it multiple times with different arguments in parallel. Since i have 1000 CPUs available I want it to run on different nodes as nodes have 16 to 64 cores. The program takes different arguments as variable.
I have the following bash script that creates and submits a SLURM script:
#!/bin/bash
arg1=5
arg2=7
total_runs=1000
directory="results/${arg1}_${arg2}"
mkdir -p $directory
tmp_slurm_script=$(mktemp /tmp/slurm_script.XXXXXX)
cat <<EOT > $tmp_slurm_script
#!/bin/bash
#SBATCH --output=${directory}/program_%A_%a.out
#SBATCH --error=${directory}/program_%A_%a.err
#SBATCH --mem-per-cpu=100M
#SBATCH --ntasks=${total_runs}
#SBATCH --cpus-per-task=1
#SBATCH --time=1:00:00
#SBATCH --job-name=${arg1}_${arg2}
#SBATCH --array=0-999
ARGUMENTS="-b ${arg2} -c $SLURM_ARRAY_TASK_ID ${directory}/$SLURM_ARRAY_TASK_ID.txt"
srun --exact -N1 -c1 -n1 ./gentourng${arg1} $ARGUMENTS &
wait
EOT
sbatch $tmp_slurm_script
rm $tmp_slurm_script
If I run the shell script, it creates the temporary slurm script and submits it. slurm then runs the program one after another on 1000 CPU cores. Instead, I want it to run 1000 times in parallel with one CPU core per call.
I tried it with --exclusive
instead of --exact
but i got the following error srun: warning: can't run 1 processes on 30 nodes, setting nnodes to 1