Transition of RTX2080Ti and V100 nodes (tg06x, tg07x) in TinyGPU from Ubuntu 18.04 with Torque to Ubuntu 20.04 with Slurm
Valued HPC users of RRZE,
as already announced in the HPC Cafe in January for summer (https://hpc.fau.de/files/2021/01/2021-01-12-hpc-cafe-tinygpu-intro.pdf), we will now reinstall the RTX2080Ti and V100 nodes in TinyGPU with Ubuntu 20.04 (instead of Ubuntu 18.04) and integrate them into the Slurm batch system of the RTX3080/A100 GPU nodes. First RTX2080Ti and V100 nodes have already been reinstalled and moved to Slurm in the past days. The remaining nodes will follow gradually until end of October to allow a smooth transition.
All RTX2080Ti and V100 nodes have successfully been reinstalled with Ubuntu 20.04 and moved to Slurm as batch system.
Besides the new batch system, the Ubuntu 20.04 nodes also use a new /apps
directory, thus, the available modules and versions of software change. As a consequence, batch scripts probably not only have to be updated with regard to the batch pragmas but also in the commands’ section. Recompilation of software may be required if dependencies changed too much. If modules are missing, let us know; however, we won’t install outdated versions of software. A snapshot of the currently available modules is listed below; the current list can be obtained on the Woody login node using ubuntu2004-env module avail
.
There is no login node with Ubuntu 20.04 yet. Thus, to (re)compile software, either (a) request an interactive job and (re)compile on a compute node, or (b) try from within a Singularity image with Ubuntu 20.04 on the login node by calling ubuntu2004-env
. The software within this Singularity container should be pretty much as available on the Ubuntu 20.04 compute nodes (including the installed CUDA libraries).
The GTX1080/GTX1080Ti nodes in TinyGPU will remain unchanged in legacy state with Ubuntu 18.04 and Torque at least until end of the year for those who cannot quickly follow the transition.
The Intel Broadwell based TinyFat nodes (tf04x/tf05x) will be reinstalled with Ubuntu 20.04 and Slurm in November.
Woody is expected to keep Ubuntu 18.04 and Torque at least until early next year.
We also eagerly awaiting the delivery of the GPU nodes of our large NHR/Tier3 procurement with Nvidia A40 and A100 GPGPUs. They hopefully will arrive in the next few weeks. These nodes will have AMD Milan processors and also use Slurm as batch system, however, they will run a RedHat8 clone (probably AlmaLinux) as operating system instead of Ubuntu. If you have software which heavily depends on external libraries, putting everything into a Singularity container, and not compiling it explicitly for Intel processors only, might be a good idea to ease the move between clusters depending on waiting time in the queues. Building a singularity container from scratch requires root permissions and, thus, cannot be done on the HPC systems. However, importing a Singularity image from Dockerhub works fine even directly on the HPC systems. Alternatively you can build the Singularity image on any of your local systems where you have root permissions (including a simple virtual Linux machine on your PC) and copy/use the resulting image file to the HPC systems.
Kind regards
hpc@fau
A quick primer for the transition from Torque to Slurm
See also https://hpc.fau.de/systems-services/documentation-instructions/batch-processing/#slurm for a more detailed description.
Batch commands (Torque vs. Slurm):
qsub.tinygpu jobscript.pbs => sbatch.tinygpu jobscript.slurm
qstat.tinygpu => squeue.tinygpu
qstat.tinygpu -f JOBID => scontrol.tinygpu show job=JOBID
qdel.tinygpu JOBID => scancel.tinygpu JOBID
qsub.tinygpu -I => salloc.tinygpu --gres=gpu:1
Within jobscripts (e.g. to resubmit jobs), the suffix .tinygpu
is again not required for any of these commands.
Batch scripts (Torque vs. Slurm):
#!/bin/bash -l => #!/bin/bash -l ## no change
#PBS -lnodes=1:ppn=4[:gputype] => no need to specify number of nodes/cores;
that's all automatically selected based
on the GPU request
=> #SBATCH -p work ## RTX2080Ti/RTX3080 nodes
#SBATCH --gres=gpu[:gputype]:1 # max. gpu:8 for RTX3080, max. gpu:4 for RTX2080Ti
set gputype as rtx2080ti or rtx3080 to allocate a specific GPU model
=> #SBATCH -p v100 ## V100 nodes
#SBATCH --gres=gpu:v100:1 # max. v100:4
=> #SBATCH -p a100 ## A100 nodes
#SBATCH --gres=gpu:a100:1 # max. a100:4
#PBS -lnodes=...:smt => n/a # SMT threads are added by default
#PBS -lwalltime=12:0:0 => #SBATCH --time=12:0:0
#PBS -N myjob => #SBATCH --job-name=myjob
#PBS -lnodes=...:likwid => #SBATCH --constraint=hwperf # measuring counters with LIKWID
--n/a-- => to get a clean environment add the following 2 lines
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
cd $PBS_O_WORKDIR => usually not required
Environment variables (Torque vs. Slurm):
$PBS_O_WORKDIR => $SLURM_SUBMIT_DIR
$PBS_JOBID => $SLURM_JOB_ID
cat $PBS_NODEFILE => scontrol show hostnames $SLURM_JOB_NODELIST
Major module changes
Special behavior which was only available at RRZE has been given up:
- The
intel64
module has been renamed tointel
and no longer automatically loadsintel-mpi
andmkl
. intel-mpi/VERSION-intel
andintel-mpi/VERSION-gcc
have been unified intointel-mpi/VERSION
. The selection of the compiler occurs by the wrapper name, e.g.mpicc
= GCC,mpiicc
= Intel;mpif90
= GFortran; `mpiifort“ = Intel.
Available modules on the Ubuntu 20.04 nodes as of 2021-10-12 (the listing may include access-restricted modules)
---------------------- /apps/modules/data/applications ----------------------
amber-gpu/20p08-at20p12-gnu-cuda11.2 orca/4.2.1
amber-gpu/20p08-at20p12-gnu-cuda11.2.0-ompi r/4.0.2-mro
amber/20p08-at20p12-intel-impi star-ccm+/2020.3.1
ansys/2020R2 vmd/1.9.3
ansys/2021R1
gromacs/2020.4-gcc-impi-mkl
gromacs/2020.4-gcc-impi-mkl-cuda11.2
gromacs/2020.4-gcc-mkl
gromacs/2020.4-gcc-mkl-cuda11.2
gromacs/2020.4-gcc-openmpi-mkl-cuda11.2
gromacs/2020.4-intel19.1-impi-mkl
gromacs/2020.4-intel19.1-mkl
gromacs/2020.6-gcc-mkl-cuda11.2
gromacs/2020.6-gcc-openmpi-mkl-cuda11.2-plumed2.7.2
gromacs/2021.1-gcc-mkl-cuda11.2
gromacs/2021.1-gcc-openmpi-mkl-cuda11.2
mathematica/12.2
matlab/R2020b
namd/2.14-multicore-cuda
------------------------ /apps/modules/data/compiler ------------------------
aocc/2.2.0 intel/2019.5 llvm/11.0.0 nvhpc/21.1 nvhpc/21.3 nvhpc/21.7
gcc/10.2.0 intel/2020.2 nvhpc/20.11 nvhpc/21.2 nvhpc/21.5 nvhpc/21.9
---------------------- /apps/modules/data/development -----------------------
cuda/11.1.0 openmpi/3.1.6-gcc9.3-legacy
cuda/11.2.0 openmpi/3.1.6-intel19.1
cuda/11.2.2 openmpi/4.0.5-gcc9.3.0-cuda-legacy
intelmpi/2019.8
----------------------- /apps/modules/data/libraries ------------------------
boost/1.74.0-gcc9.3 fftw/3.3.8-intel19.1-openmpi tbb/2020.3
boost/1.74.0-intel19.1 hdf5/1.10.7-gcc9.3-impi
cudnn/8.0.5.39-cuda11.1 hdf5/1.10.7-gcc9.3-openmpi
eigen/3.3.8 mkl/2019.5
fftw/3.3.8-gcc9.3 mkl/2020.3
fftw/3.3.8-gcc9.3-impi mkl/2020.4
fftw/3.3.8-gcc9.3-openmpi quantumtools/qflex-head
fftw/3.3.8-intel19.1 quantumtools/qsimcirq-0.7.1
fftw/3.3.8-intel19.1-impi quantumtools/qsimcirq-qulacs-2021-01
------------------------- /apps/modules/data/tools --------------------------
cmake/3.18.4 julia/1.6.1 likwid/5.1.1-msr
ddt/20.2 likwid/5.1.0a-msr python/3.8-anaconda
----------------------- /apps/modules/data/via-spack ------------------------
000-all-spack-pkgs/0.16.0 000-all-spack-pkgs/0.16.1
------------------------ /apps/modules/data/testing -------------------------
lammps/20201029-gcc9.3.0-openmpi4.0.5-cuda11.2.2-mkl user-spack/0.16.2