Advanced topics Slurm

For some workflows and applications, functionality that goes beyond the basic usage of the Slurm batch system is required. Information on the following topics is given below:

Array jobs
Chain jobs
Chain jobs with dependencies
Job priorities
Exclusive jobs for benchmarking
Specific clock frequency

Array jobs

Array jobs can be used to submit multiple jobs that share the same parameters, like executable and resource requirements. They can be controlled and monitored as a single unit. The Slurm option is -a, --array=<indexes> where the parameter indexes specifies the array index values that should be used. The following specifications are possible

comma separated list, e.g., --array=0,1,2,17,
range based, e.g., --array=0-15,
mix of comma separated and range base, e.g., --array=0,1,10-12,
step based, e.g., --array=0-15:4.

A maximum number of simultaneously running tasks from the job array may be specified using the % separator. The specification --array=0-20%6 limits the number of simultaneously running tasks from this job array to 6.

Within the job, two specific environment variables are available: SLURM_ARRAY_JOB_ID is set to the first jobid of the array and SLURM_ARRAY_TASK_ID is set individually for each array element. You can use these variables inside the job script to distinguish between array elements.

Chain jobs

For some calculations, it is beneficial to automatically submit a subsequent job after the current run has finished. This can be achieved by including the submit command in your job script. However, it has to be considered that the job will always resubmit itself, even if something goes wrong during run time, e.g. missing input files. This can lead to jobs running wild until they are manually aborted. To prevent this from happening, the job should only be resubmitted if it has run for a sufficiently long time. The following approach can be used:

Example: resubmitting job script for chain jobs
#!/bin/bash -l if [ "$SECONDS" -gt "7000" ]; then cd ${`SLURM_SUBMIT_DIR`} sbatch job_script fi

The bash environment variable $SECONDS contains the run time of the shell in seconds. Please note that it is not defined for csh.

On the TinyX clusters, also sbatch has to be used within the submit scripts, instead of the machine-specific sbatch.tinyx.

Chain jobs with dependencies

In contrast to the previously mentioned chain jobs, this functionality can be used if your job relies on the results of more than one preceding job. Slurm has an option -d, --dependency=<dependency_list> to specify that a job is only allowed to start if specific conditions are satisfied.

--dependency=afterany:job_id[:job_id] will start the job when all of the listed jobs have terminated.

There are a number of other possible specifications for <dependency_list>. For full details, please consult the official Slurm documentation.

Job priorities

The scheduler of the batch system assigns a priority to each waiting job. This priority value depends on certain parameters (like waiting time, partition, user group, and recently used CPU time (a.k.a. fairshare)). The ordering of waiting jobs listed by squeue does not reflect the priority of jobs.

If your job is not starting straight away, you can see the reason for the delay in the column NODELIST(REASON).

In the following table, some of the most common reasons are listed. <Resource> can be a limit for any generic resource ( number of GRES (GPUs), Nodes, CPUs or concurrent running/queued jobs), e.g. AssocGrpGRES specifies that all GPUs assigned to your association or group are currently in use.

`Priority`	One or more higher priority jobs are queued. Your job will eventually run.
`Dependency`	This job is waiting for a dependent job to complete and will run afterwards.
`Resources`	The job is waiting for resources to become available and will eventually run.
`AssociationGroup<Resource>Limit`	All resources assigned to your association/group are in use; job will run eventually.
`QOSGrp<Resource>Limit`	All resources assigned to the specified QoS have been met; job will run eventually.
`Partition<Resource>Limit`	All resources assigned to the specified partition are in use; job will run eventually.
`ReqNodeNotAvail`	Some node specifically required by the job is not currently available. The node may be in use, reserved, or currently unavailable (e.g. drained, down or not responding). In the latter case, it may be necessary to cancel the job and rerun with other node specifications.

Exclusive jobs for benchmarking

On some HPC clusters (e.g. Alex and TinyX), compute nodes are shared among multiple users and jobs. Resources like GPUs and compute cores are never shared. In some cases, e.g. for benchmarking, exclusive access to the compute node can be desired. This can be achieved by using the Slurm parameter --exclusive.

Setting --exclusive only makes sure that there will be no other jobs running on your nodes. It will not automatically give you access to all resources of the node without explicitly requesting them. This means you still have to specify your desired number of GPUs via the gres parameter and you still have to request all cores of a node if you need them via --ntasks or --cpus-per-task.

Independent of your resource allocation and usage, exclusive jobs will be billed with all available resources of the node.

Specific clock frequency

By default, the compute nodes at RRZE run with enabled turbo mode and the “ondemand” governor. Properties can also be used to request a certain CPU clock frequency. This is not something you will usually want to do, but it can be used for certain kinds of benchmarking. Note that you cannot make the CPUs go any faster, only slower, as the default already is the turbo mode, which makes the CPU clock as fast as it can without exceeding its thermal or power budget. So please do not use any of the following options unless you know what you’re doing. likwid-setFrequencies is not supported on the clusters.

With Slurm, the frequency (in kHz) can be specified using the --cpu-freq option of srun. For a pure MPI code, using srun to start the processes with a fixed clock speed of 1.8 GHz works as follows:

$ srun --cpu-freq=1800000-1800000:performance <more-srun-options> ./a.out <arguments>

You can also use the --cpu-freq option with salloc or sbatch; however, the frequency will only be set once srun is called inside an allocation, i.e., within the job script.

If you choose to employ likwid-mpirun, you have to tell it that the cpu frequency option must be handed down to Slurm:

$ likwid-mpirun <further-likwid-options> -mpi slurm --mpiopts "--cpu-freq=1800000-1800000:performance" ./a.out <arguments>