Transition of RTX2080Ti and V100 nodes (tg06x, tg07x) in TinyGPU from Ubuntu 18.04 with Torque to Ubuntu 20.04 with Slurm

Symbolic picture for the article. The link opens the image in a large view.

Valued HPC users of RRZE,

as already announced in the HPC Cafe in January for summer (, we will now reinstall the RTX2080Ti and V100 nodes in TinyGPU with Ubuntu 20.04 (instead of Ubuntu 18.04) and integrate them into the Slurm batch system of the RTX3080/A100 GPU nodes. First RTX2080Ti and V100 nodes have already been reinstalled and moved to Slurm in the past days. The remaining nodes will follow gradually until end of October to allow a smooth transition.

All RTX2080Ti and V100 nodes have successfully been reinstalled with Ubuntu 20.04 and moved to Slurm as batch system.

Besides the new batch system, the Ubuntu 20.04 nodes also use a new /apps directory, thus, the available modules and versions of software change. As a consequence, batch scripts probably not only have to be updated with regard to the batch pragmas but also in the commands’ section. Recompilation of software may be required if dependencies changed too much. If modules are missing, let us know; however, we won’t install outdated versions of software. A snapshot of the currently available modules is listed below; the current list can be obtained on the Woody login node using ubuntu2004-env module avail.

There is no login node with Ubuntu 20.04 yet. Thus, to (re)compile software, either (a) request an interactive job and (re)compile on a compute node, or (b) try from within a Singularity image with Ubuntu 20.04 on the login node by calling ubuntu2004-env. The software within this Singularity container should be pretty much as available on the Ubuntu 20.04 compute nodes (including the installed CUDA libraries).

The GTX1080/GTX1080Ti nodes in TinyGPU will remain unchanged in legacy state with Ubuntu 18.04 and Torque at least until end of the year for those who cannot quickly follow the transition.
The Intel Broadwell based TinyFat nodes (tf04x/tf05x) will be reinstalled with Ubuntu 20.04 and Slurm in November.
Woody is expected to keep Ubuntu 18.04 and Torque at least until early next year.

We also eagerly awaiting the delivery of the GPU nodes of our large NHR/Tier3 procurement with Nvidia A40 and A100 GPGPUs. They hopefully will arrive in the next few weeks. These nodes will have AMD Milan processors and also use Slurm as batch system, however, they will run a RedHat8 clone (probably AlmaLinux) as operating system instead of Ubuntu. If you have software which heavily depends on external libraries, putting everything into a Singularity container, and not compiling it explicitly for Intel processors only, might be a good idea to ease the move between clusters depending on waiting time in the queues. Building a singularity container from scratch requires root permissions and, thus, cannot be done on the HPC systems. However, importing a Singularity image from Dockerhub works fine even directly on the HPC systems. Alternatively you can build the Singularity image on any of your local systems where you have root permissions (including a simple virtual Linux machine on your PC) and copy/use the resulting image file to the HPC systems.

Kind regards

A quick primer for the transition from Torque to Slurm

See also for a more detailed description.

Batch commands (Torque vs. Slurm):

qsub.tinygpu  jobscript.pbs      => sbatch.tinygpu  jobscript.slurm
qstat.tinygpu                    => squeue.tinygpu
qstat.tinygpu -f JOBID           => scontrol.tinygpu show job=JOBID
qdel.tinygpu  JOBID              => scancel.tinygpu  JOBID
qsub.tinygpu -I                  => salloc.tinygpu --gres=gpu:1

Within jobscripts (e.g. to resubmit jobs), the suffix .tinygpu is again not required for any of these commands.

Batch scripts (Torque vs. Slurm):

#!/bin/bash -l                   => #!/bin/bash -l      ## no change
#PBS -lnodes=1:ppn=4[:gputype]   => no need to specify number of nodes/cores;
                                    that's all automatically selected based
                                    on the GPU request
                                 => #SBATCH -p work    ## RTX2080Ti/RTX3080 nodes
                                    #SBATCH --gres=gpu[:gputype]:1    # max. gpu:8 for RTX3080, max. gpu:4 for RTX2080Ti
                                    set gputype as rtx2080ti or rtx3080 to allocate a specific GPU model
                                 => #SBATCH -p v100    ## V100 nodes
                                    #SBATCH --gres=gpu:v100:1   # max. v100:4
                                 => #SBATCH -p a100    ## A100 nodes
                                    #SBATCH --gres=gpu:a100:1   # max. a100:4
#PBS -lnodes=...:smt             => n/a                         # SMT threads are added by default
#PBS -lwalltime=12:0:0           => #SBATCH --time=12:0:0
#PBS -N myjob                    => #SBATCH --job-name=myjob
#PBS -lnodes=...:likwid          => #SBATCH --constraint=hwperf # measuring counters with LIKWID

--n/a--                          => to get a clean environment add the following 2 lines
                                    #SBATCH --export=NONE
                                    unset SLURM_EXPORT_ENV
cd $PBS_O_WORKDIR                => usually not required

Environment variables (Torque vs. Slurm):

$PBS_JOBID              => $SLURM_JOB_ID
cat $PBS_NODEFILE       => scontrol show hostnames $SLURM_JOB_NODELIST

Major module changes

Special behavior which was only available at RRZE has been given up:

  1. The intel64 module has been renamed to intel and no longer automatically loads intel-mpi and mkl.
  2. intel-mpi/VERSION-intel and intel-mpi/VERSION-gcc have been unified into intel-mpi/VERSION. The selection of the compiler occurs by the wrapper name, e.g. mpicc = GCC, mpiicc = Intel; mpif90 = GFortran; `mpiifort“ = Intel.

Available modules on the Ubuntu 20.04 nodes as of 2021-10-12 (the listing may include access-restricted modules)

---------------------- /apps/modules/data/applications ----------------------
amber-gpu/20p08-at20p12-gnu-cuda11.2                 orca/4.2.1          
amber-gpu/20p08-at20p12-gnu-cuda11.2.0-ompi          r/4.0.2-mro         
amber/20p08-at20p12-intel-impi                       star-ccm+/2020.3.1  
ansys/2020R2                                         vmd/1.9.3           

------------------------ /apps/modules/data/compiler ------------------------
aocc/2.2.0  intel/2019.5  llvm/11.0.0  nvhpc/21.1  nvhpc/21.3  nvhpc/21.7  
gcc/10.2.0  intel/2020.2  nvhpc/20.11  nvhpc/21.2  nvhpc/21.5  nvhpc/21.9  

---------------------- /apps/modules/data/development -----------------------
cuda/11.1.0      openmpi/3.1.6-gcc9.3-legacy         
cuda/11.2.0      openmpi/3.1.6-intel19.1             
cuda/11.2.2      openmpi/4.0.5-gcc9.3.0-cuda-legacy  

----------------------- /apps/modules/data/libraries ------------------------
boost/1.74.0-gcc9.3        fftw/3.3.8-intel19.1-openmpi          tbb/2020.3  
boost/1.74.0-intel19.1     hdf5/1.10.7-gcc9.3-impi               
cudnn/    hdf5/1.10.7-gcc9.3-openmpi            
eigen/3.3.8                mkl/2019.5                            
fftw/3.3.8-gcc9.3          mkl/2020.3                            
fftw/3.3.8-gcc9.3-impi     mkl/2020.4                            
fftw/3.3.8-gcc9.3-openmpi  quantumtools/qflex-head               
fftw/3.3.8-intel19.1       quantumtools/qsimcirq-0.7.1           
fftw/3.3.8-intel19.1-impi  quantumtools/qsimcirq-qulacs-2021-01  

------------------------- /apps/modules/data/tools --------------------------
cmake/3.18.4  julia/1.6.1        likwid/5.1.1-msr     
ddt/20.2      likwid/5.1.0a-msr  python/3.8-anaconda  

----------------------- /apps/modules/data/via-spack ------------------------
000-all-spack-pkgs/0.16.0  000-all-spack-pkgs/0.16.1  

------------------------ /apps/modules/data/testing -------------------------
lammps/20201029-gcc9.3.0-openmpi4.0.5-cuda11.2.2-mkl  user-spack/0.16.2