Batch Processing
This website shows information regarding the following topics:
All of the HPC clusters (with the exception of a few special machines) run under the control of a batch system. All user jobs except short serial test runs must be submitted to the cluster through this batch system. The submitted jobs are then routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme.
A job will run when the required resources become available. On most clusters, a number of nodes is reserved during working hours for short test runs with less than one hour of runtime. These nodes are dedicated to the devel
queue. We do not allow MPI-parallel applications on the frontends, short parallel test runs must be performed using batch jobs.
It is also possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive (including X11) programs there.
The older clusters use a software called Torque as the batch system, newer clusters starting with “meggie” instead use Slurm. Sadly, there are many differences between those two systems. We will describe both below.
Torque
Commands for Torque
The command to submit jobs is called qsub
. To submit a batch job use
qsub <further options> [<job script>]
The job script may be omitted for interactive jobs (see below). After submission, qsub will output the Job ID of your job. It can later be used for identification purposes and is also available as the environment variable $PBS_JOBID
in job scripts (see below). These are the most important options for the qsub
command:
Option | Meaning |
---|---|
-N <job name> |
Specifies the name which is shown with qstat . If the option is omitted, the name of the batch script file is used. |
-l nodes=<# of nodes>:ppn=<nn> |
Specifies the number of nodes requested. All current clusters (except the SandyBridge partition within Woody) require you to always request full nodes. Thus, for Emmy you always need to specify :ppn=40 , and for Woody (usually) :ppn=4 . For other clusters, see the documentation of the respective clusters for the correct ppn values. |
-l walltime=HH:MM:SS |
Specifies the required wall clock time (runtime). When the job reaches the walltime given here it will be sent a TERM signal. After a few seconds, if the job has not ended yet, it will be sent KILL . If you omit the walltime option, a – very short – default time will be used. Please specify a reasonable runtime, since the scheduler bases its decisions also on this value (short jobs are preferred). |
-M x@y -m abe |
You will get e-mail to x@y when the job is aborted (a), starting (b), and ending (e). You can choose any subset of abe for the -m option. If you omit the -M option, the default mail address assigned to your RRZE account will be used. |
-o <standard output file> |
File name for the standard output stream. If this option is omitted, a name is compiled from the job name (see -N ) and the job ID. |
-e <error output file> |
File name for the standard error stream. If this option is omitted, a name is compiled from the job name (see -N ) and the job ID. |
-I |
Interactive job. It is still allowed to specify a job script, but it will be ignored, except for the PBS options it might contain. No code will be executed. Instead, the user will get an interactive shell on one of the allocated nodes and can execute any command there. In particular, you can start a parallel program with mpirun . |
-X |
Enable X11 forwarding. If the $DISPLAY environment variable is set when submitting the job, an X program running on the compute node(s) will be displayed at the user’s screen. This makes sense only for interactive jobs (see -I option). |
-W depend:<dependency list> |
Makes the job depend on certain conditions. E.g., with -W depend=afterok:12345 the job will only run after Job 12345 has ended successfully, i.e. with an exit code of zero. Please consult the qsub man page for more information. |
-q <queue> |
Specifies the Torque queue (see above); default queue is route . Usually it is not required to use this parameter as the route queue automatically forwards the job to an appropriate execution queue. |
There are several Torque commands for job inspection and control. The following table gives a short summary:
Command | Purpose | Options |
---|---|---|
qstat [<options>] [<JobID>|<queue>] |
Displays information on jobs. Only the user’s own jobs are displayed. For information on the overall queue status see the section on job priorities. | -a display “all” jobs in user-friendly format-f extended job info-r display only running jobs |
qdel <JobID> ... |
Removes job from queue | – |
qalter <qsub-options> |
Changes job parameters previously set by qsub . Only certain parameters may be changed after the job has started. |
see qsub and the qalter manual page |
qcat [<options>] <JobID> |
Displays stdout/stderr from a running job | -o display stdout (default)-e display stderr-f output appended data as the job is running (like tail -f |
The scheduler typically sets environment variables to tell the job about what resources were allocated to it. These can also be used in batch scripts. The most useful are given below:
Job ID | $PBS_JOBID |
Directory from which the job was submitted | $PBS_O_WORKDIR |
List of nodes on which job runs (filename) | cat $PBS_NODEFILE |
Number of nodes allocated to job | $PBS_NUM_NODES |
Batch scripts for Torque
To submit a batch job you have to write a shell script that contains all the commands to be executed. Job parameters like estimated runtime and required number of nodes/CPUs can also be specified there (instead of on the command line):
#!/bin/bash -l # # allocate 4 nodes (80 Cores / 160 SMT threads) for 6 hours #PBS -l nodes=4:ppn=40,walltime=06:00:00 # # job name #PBS -N Sparsejob_33 # # first non-empty non-comment line ends PBS options #load required modules (compiler, MPI, ...) module load example1 # jobs always start in $HOME - # change to work directory cd ${PBS_O_WORKDIR} # uncomment the following lines to use $FASTTMP # mkdir ${FASTTMP}/$PBS_JOBID # cd ${FASTTMP}/$PBS_JOBID # copy input file from location where job was submitted # cp ${PBS_O_WORKDIR}/inputfile . # run, using only physical cores mpirun -n 80 a.out -i inputfile -o outputfile |
#!/bin/bash -l # # allocate 1 node (4 Cores) for 6 hours #PBS -l nodes=1:ppn=4,walltime=06:00:00 # # job name #PBS -N Sparsejob_33 # # first non-empty non-comment line ends PBS options #load required modules (compiler, ...) module load intel64 # jobs always start in $HOME - # change to work directory cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=4 # run ./a.out |
The comment lines starting with #PBS
are ignored by the shell but interpreted by Torque as options for job submission (see above for an options summary). These options can all be given on the qsub
command line as well. The example also shows the use of the $FASTTMP
and $HOME
variables. $PBS_O_WORKDIR
contains the directory where the job was submitted. All batch scripts start executing in the user’s $HOME
so some sort of directory change is always in order.
If you have to load modules from inside a batch script, you can do so. The only requirement is that you have to use either a csh
-based shell or bash
with the -l
switch, like in the example above.
Interactive Jobs with Torque
For testing purposes or when running applications that require some manual intervention (like GUIs), Torque offers interactive access to the compute nodes that have been assigned to a job. To do this, specify the -I
option to the qsub
command and omit the batch script. When the job is scheduled, you will get a shell on the master node (the first in the assigned job node list). It is possible to use any command, including mpirun
, there. If you need X forwarding, use the -X
option in addition to -I
.
Note that the starting time of an interactive batch job cannot reliably be determined; you have to wait for it to get scheduled. Thus we recommend to always run such jobs with wallclock time limits less than one hour, so that the job will be routed to the devel
queue for which a number of nodes is reserved during working hours.
Interactive batch jobs do not produce stdout
and stderr
files. If you want a protocol of what’s happened, use e.g. the UNIX script
command.
Slurm
Commands for Slurm
The command to submit jobs is called sbatch
. To submit a batch job use
sbatch [options] <job script>
After submission, sbatch will output the Job ID of your job. It can later be used for identification purposes and is also available as the environment variable $SLURM_JOBID
in job scripts (see below). The following parameters can be specified as options for sbatch
or included in the job script by using the script directive #SBATCH
:
--job-name=<name> |
Specifies the name which is shown with squeue . If the option is omitted, the name of the batch script file is used. |
--nodes=<number> |
Specifies the number of nodes requested. Default value is 1. |
--ntasks=<number> |
Overall number of tasks (MPI processes). Can be omitted if –nodes and –ntasks-per-node are given. Default value is 1. |
--ntasks-per-node=<number> |
Number of tasks (MPI processes) per node. |
--cpus-per-task=<number> |
Number of threads (logical cores) per task. Used for OpenMP or hybrid jobs. |
--time=HH:MM:SS |
Specifies the required wall clock time (runtime). When the job reaches the walltime given here it will be sent a TERM signal. After a few seconds, if the job has not ended yet, it will be sent KILL . If you omit the walltime option, a – very short – default time will be used. Please specify a reasonable runtime, since the scheduler bases its decisions also on this value (short jobs are preferred). |
--mail-user=<address>
|
You will get e-mail to <address> depending on the type you have specified. As type, you can choose either BEGIN , END , FAIL , TIME_LIMIT or ALL . Specifying more than one option is also possible. |
--output=<file_name>
|
File name for the standard output stream. This should not be used, since a suitable name is automatically compiled from the job name and the job ID. |
--error=<file_name>
|
File name for the standard error stream. Per default, stderr is merged with stdout. |
--partition=<partition> |
Specifies the partition/queue to which the job is submitted. If no partition is given, “work” is used. Partition “devel” has to be requested if job qualifies. Jobs in this queue will be run with higher priority. |
--constraint=hwperf |
Access to hardware performance counters (e.g. using likwid-perfctr ). Only request that feature if you really want to access the hardware performance counters! |
There are several Slurm commands for job inspection and control. The following table gives a short summary:
Command | Purpose | Options |
---|---|---|
squeue <options> |
Displays information on jobs. Only the user’s own jobs are displayed. | -t running display currently running jobs-j <JobID> display info on job <JobID> |
scancel <JobID> |
Removes job from queue or terminates it if it’s already running. | |
scontrol show job <JobID> |
Displays very detailed information on jobs. |
The scheduler typically sets environment variables to tell the job about what resources were allocated to it. These can also be used in batch scripts. The most useful are given below:
Job ID | $SLURM_JOB_ID |
Directory from which the job was submitted | $SLURM_SUBMIT_DIR |
List of nodes on which job runs | $SLURM_JOB_NODELIST |
Number of nodes allocated to job | $SLURM_JOB_NUM_NODES |
By default Slurm jobs will automatically start in the directory where the job was submitted.
One of the differences to Torque is the propagation of environment variables which are set at the time of submission into the Slurm job. This includes currently loaded module files. To have a clean environment in job scripts, it is recommended to add #SBATCH --export=NONE
and unset SLURM_EXPORT_ENV
to the job script. Otherwise, the job will inherit some settings from the submitting shell.
Batch scripts for Slurm
#!/bin/bash -l # # allocate 4 nodes with 20 cores per node = 4*20 = 80 MPI tasks #SBATCH --nodes=4 #SBATCH --tasks-per-node=20 # # allocate nodes for 6 hours #SBATCH --time=06:00:00 # job name #SBATCH --job-name=Sparsejob_33 # do not export environment variables #SBATCH --export=NONE # # first non-empty non-comment line ends SBATCH options # do not export environment variables unset SLURM_EXPORT_ENV # jobs always start in submit directory #load required modules (compiler, MPI, ...) module load example1 # uncomment the following lines to use $FASTTMP # mkdir ${FASTTMP}/$SLURM_JOB_ID # cd ${FASTTMP}/$SLURM_JOB_ID # copy input file from location where job was submitted # cp ${SLURM_SUBMIT_DIR}/inputfile . # run srun a.out |
#!/bin/bash -l # # allocate 1 nodes with 20 physical cores, without hyperthreading #SBATCH --nodes=1 #SBATCH --cpus-per-task=20 # # allocate nodes for 6 hours #SBATCH --time=06:00:00 # job name #SBATCH --job-name=Sparsejob_33 # do not export environment variables #SBATCH --export=NONE # # first non-empty non-comment line ends SBATCH options # do not export environment variables unset SLURM_EXPORT_ENV # jobs always start in submit directory #load required modules (compiler,...) module load intel64 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # run ./a.out |
Interactive Jobs with Slurm
To run an interactive Job with Slurm:
srun [Usual srun arguments for number of nodes, walltime, etc.] --pty /bin/bash -l
This will queue a job and give you a shell on the first node allocated as soon as the job starts. The parameters for srun
are the same as for sbatch
, which are stated above.
There is currently no way to request X11 forwarding to an interactive Slurm job.
Advanced topics
Staging Out Results
Warning! This does not work with the current version of the batch system due to a software bug!
When a job reaches its walltime limit, it will be killed by the batch system. The job’s node-local data will either get deleted (if you use $TMPDIR
or be inaccessible because login to a node is disallowed if you don’t have a job running there. In order to prevent data loss, Torque waits 60 seconds after the TERM
signal before sending the final KILL
. If the batch script catches TERM
with a signal handler, those 60 seconds can be used to copy node-local data to a global file system:
#!/bin/bash # signal handler: catch SIGTERM, save scratch data trap "sleep 5 ; cd $TMPDIR ; tar cf - * | tar xf - -C ${WOODYHOME}/$PBS_JOBID ; exit" 15 # make job data save directory mkdir ${WOODYHOME}/$PBS_JOBID cd $PBS_O_WORKDIR # assuming a.out stores temp data in $TMPDIR mpirun ./a.out |
The sleep
command at the start of the signal handler gives your application some time to shut down before the data is saved. Please note that it is required to use a Bourne or Korn shell variant for catching the TERM
signal since csh
has only limited facilities for signal handling.
Trapping signals
Signals like SIGUSR1 are not processed while a process is running as detailed in the bash manpage: If bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will not be executed until the command completes. When bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed.
Chain Jobs
For some calculations, it is beneficial to automatically submit a subsequent job after the current run has finished. This can be achieved by including the submit command in your job script. However, it has to be considered that the job will always resubmit itself, even if something goes wrong during run time, e.g. missing input files. This can lead to jobs running wild until they are manually aborted. To prevent this from happening, the job should only be resubmitted if it has run for a sufficiently long time. The following approach can be used:
#!/bin/bash if [ "$SECONDS" -gt "7000" ]; then cd ${PBS_O_WORKDIR} qsub job_script fi |
The bash environment variable $SECONDS
contains the run time of the shell in seconds. Please note that it is not defined for csh
.
On the TinyX clusters, also qsub
has to be used within the submit scripts, instead of the machine specific qsub.tinyx
.
Job Priorities and Reservations
The scheduler of the batch system assigns a priority to each waiting job. This priority value depends on certain parameters (like waiting time, queue, user group, and recently used CPU time (a.k.a. fairshare)). The ordering of waiting jobs listed by qstat
does not reflect the priority of jobs. All waiting jobs with their assigned priority are listed anonymously on the HPC user web pages (some of those pages are password protected; execute the docpw
command to get the username and password). There you also get a list of all running jobs, any node reservations, and all jobs which cannot be scheduled for some reason. Some of this information is also available in text form: The text file /home/woody/STATUS/joblist
contains a list of all waiting jobs; the text file /home/woody/STATUS/nodelist
contains information about node and queue activities.
Job Monitoring
On meggie and emmy, it is possible to access performance data of your finished jobs, including e.g. memory used, floating point rate and usage of the (parallel) file system. To review this information, you need a job specific AccessKey, which can be found in the output file.
Specific Clock Frequency
By default, the compute nodes at RRZE run with enabled turbo mode and ondemand governor. Properties can also be used to request a certain CPU clock frequency. This is not something you will usually want to do, but it can be used for certain kinds of benchmarking. Note that you cannot make the CPUs go any faster, only slower, as the default already is the turbo mode, which makes the CPU clock as fast as it can (up to 2.6 GHz) without exceeding its thermal or power budget. So please do not use any of the following options unless you know what you’re doing. likwid-setFreq
is not supported on the clusters.
With Torque as a batch system, the available options are: :noturbo
to disable Turbo Mode, :f2.2
to request 2.2 GHz (this is equivalent to :noturbo
), :f2.1
to request 2.1 GHz, and so on in 0.1 GHz steps down to :f1.2
to request 1.2 GHz.
With Slurm as batch system, the frequency (in kHz) can be specified using the --cpu-freq
option of sbatch
or srun
; however, the frequency will only be set once srun
is called (directly or indirectly). This can be a problem for single node OpenMP applications, or on the 1st node of an MPI application started with mpirun
which still are on turbo & ondemand.