I’m running a single slurm array job with many tasks. Each task may fail and be restarted and each task needs to know how many times it has been restarted. I was hoping the environment variable SLURM_RESTART_COUNT
would solve this, but it seems to increment every time any task is restarted (i.e., the global restarts for the whole job, not just that one task). Does slurm save the task restart count somewhere I’m not seeing, or do I need to parse the sacct
logs to get that info?
Unfortunately, Slurm does not directly provide a task-specific restart count and indeed as you mentioned SLURM_RESTART_COUNT
reflects global count.
As you mentioned parsing sacct is one solution, second solution is a naive but effective one which you might already have thought of.
Write the taskID in a a shared file when a task is launched (initial scenario) and whenever it is restarted update the counter associated with the taskID. You can leverage SLURM_ARRAY_TASK_ID
environmental variable for this.