Alex GPGPU cluster (NHR+Tier3)
FAU’s Alex cluster (system integrator: Megware) is a high-performance compute resource with Nvidia GPGPU accelerators and partially high speed interconnect. It is intended for single and multi GPGPU workloads, e.g. from molecular dynamics, or machine learning. Alex serves for both, FAU’s basic Tier3 resources as well as NHR’s project resources.
- 2 front end nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 512 GB of RAM, and 100 GbE connection to RRZE’s network backbone but no GPGPUs.
- 20 GPGPU nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 1,024 GB of DDR4-RAM, eight Nvidia A100 (each 40 GB HBM2 @ 1,555 GB/s; HGX board with NVLink; 9.7 TFlop/s in FP64 or 19.5 TFlop/s in FP32), two HDR200 Infiniband HCAs, 25 GbE, and 14 TB on local NVMe SSDs.
- 18 GPGPU nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 2,048 GB of DDR4-RAM, eight Nvidia A100 (each 80 GB HBM2 @ 2,039 GB/s; HGX board with NVLink; 9.7 TFlop/s in FP64 or 19.5 TFlop/s in FP32), two HDR200 Infiniband HCAs, 25 GbE, and 14 TB on local NVMe SSDs.
- 44 GPGPU nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3Cache per chip, 512 GB of DDR4-RAM, eight Nvidia A40 (each with 48 GB DDR6 @ 696 GB/s; 37.42 TFlop/s in FP32), 25 GbE, and 7 TB on local NVMe SSDs.
So as of May 2023 there is a total of 352 Nvidia A40, 160 Nvidia A100/40GB, and 144 Nvidia A100/80GB GPGPUs. The Nvidia A40 GPGPUs have a very high single precision floating point performance (even higher than an A100!) and are much less expensive than Nvidia A100 GPGPUs. All workloads which only require single precision floating point operations, like many molecular dynamics applications, thus, should target the Nvidia A40 GPGPUs.
Alex complements RRZE’s TinyGPU cluster. Alex addresses high-end GPGPU workloads while TinyGPU mainly comes with consumer GPUs of different generations which, nevertheless, provide an invincible price-performance ratio for single precision floating point applications which require only little GPU memory. TinyGPU currently also includes 8 nodes with a total of 32 Nvidia A100/40GB GPGPUs; these nodes may at a later point in time be moved to Alex.
On 160 Nvidia A100/40GB GPGPUs (i.e. 20 nodes), a LINPACK performance of 1.73 PFlop/s has been measured in January 2022.
On 160 Nvidia A100/40GB plus 96 Nvidia A100/80GB GPGPUs (i.e. 32 nodes = 256 A100 GPGPUs in total), a LINPACK performance of 2.938 PFlop/s has been measured in May 2022 resulting in place 184 of the June 2022 Top500 list and place 17 in the Green500 of June 2022.
On 160 Nvidia A100/40GB plus 120 Nvidia A100/80GB GPGPUs (i.e. 35 nodes = 280 A100 GPGPUs in total), a LINPACK performance of 3.24 PFlop/s has been measured in Oct. 2022 resulting in place 174 of the Nov. 2022 Top500 list and place 33 in the Green500 of Nov. 2022.
On 160 Nvidia A100/40GB plus 144 Nvidia A100/80GB GPGPUs (i.e. 38 nodes = 304 A100 GPGPUs in total), a LINPACK performance of 4.030 PFlop/s has been measured in April 2023 resulting in place 157 of the June 2023 Top500 list and place 36 in the Green500 of May 2023.
The name “Alex” is a play with the name of FAU’s early benefactor Alexander, Margrave of Brandenburg-Ansbach (1736-1806).
financing
Alex has been financed by:
- NHR funding of federal and state authorities (BMBF and Bavarian State Ministry of Science and the Arts, respectively),
- German Research Foundation (DFG) as part of INST 90/1171-1 (440719683),
- seven A100 nodes are dedicated to HS Coburg as part of the BMBF proposal “HPC4AAI” within the call “KI-Nachwuchs@FH”,
- one A100 nodes is financed by and dedicated to an external group from Erlangen,
- and financial support of FAU to strengthen HPC activities.
This website shows information regarding the following topics:
Access, User Environment, and File Systems
Access to the machine
Note that FAU HPC accounts are not automatically enabled for Tier3 access to Alex. To request Tier3 access to Alex, you need to work on a project with extended demands, thus, not feasible on TinyGPU, but sill below the NHR thresholds. You have proof that extended demand, the efficiency of your jobs, and provide a short description of what you want to do there: https://hpc.fau.de/tier3-access-to-alex/.
The rules for NHR access are described on our page on NHR application rules.
Users can connect to alex.nhr.fau.de
(keep the “nhr” instead of “rrze” in mind!) and will be randomly routed to one of the two front ends. All systems in the cluster, including the front ends, have private IPv4 addresses in the 10.28.52.0/23
and IPv6 addresses in the 2001:638:a000:3952::/64
range. They can normally only be accessed directly from within the FAU networks. There is one exception: If your internet connection supports IPv6, you can directly ssh to the front ends (but not to the compute nodes). Otherwise, if you need access from outside of FAU, you usually have to connect for example to the dialog server cshpc.rrze.fau.de
first and then ssh to alex.nhr.fau.de
from there. For HPC protal/NHR users we provide a template that can be added to the local .ssh/config
.
SSH public host keys of the (externally) reachable hosts
SSH public host keys of alex.nhr.fau.de (as of 11/2021)
ssh-dss AAAAB3NzaC1kc3MAAACBAO/DMbHuyYO6vWXgoeFgaVXFIbg6vldW3ViGOJSd/yVopqhB/fxdp4z1SioML9YOSNepr58xpgoFXFpM+DgRgwcIMBYbV3CeyPYoF4ZAvVwkQLGZh5zmn1Zxd6U3B49aZaEYnItRO6VKGW/Bm6cKY3H+FW5NUa8u+CQOjbjCmixBAAAAFQDpdsCURZAgCd8durljTJHF2AMR+wAAAIAWxlbOXYcMdgmYWE7Af3CyKysbaC1whHNiWOK3v4b0HEZ3CWQe50rrZWDzTKyal0AkncghPMusz5hqZCbZC3DrAParSTwk8RGXsbRm6O/cF3JBKP6IhIBvc8kEVaqFeyDuFwMXwzwQU8x4esAkIu+GDCiCADlhiGSf2Uw6pEds+gAAAIEAgVxOFD9eFM+pDMSw/NlyVXSVA512uC4/JnrHDfY+6SjhuOcfd5JOWjDNYxKO0xPruj0H/TpAI+h90/yUHUff9F/g8rPg9S55DtsUyJHY8B9mm7/mKnJfcT68EBheH00Vl4yGLFu6q4mmfwoHVSoV7QikVj5vOlmWOVMjJen8NKc= alex.nhr.fau.de ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCIUVUt/y9dOrXP3aZ3meUF8s77/d+sk/F31tMnw2TNL4mk6J5Ylk2SOtDL7GTCrxmj3/RXMrrPCKO8FDJR2SzE= alex.nhr.fau.de ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINPXIVMupI341xGq6Gb5agSqurqTSssyBORWKx3wAU0p alex.nhr.fau.de ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDMOn7p0mPhTeZndNjnLIF7RKeA+WaXz4vJ0lFEo7cpXV9I5AmbduM/GkzEdGNAvVgWmcYtW3R53R23c1eikSFFx6aUaK1rb0kp2SYlh+JUvXRLIg+oIK47Do3lQ7qDas1Q7U9wssHr1wrs5g6dsQj+v7UFJcCAqcAz4KfxaJrG8MkYpI0P38TSe3p39+ObDv+NoBKobHgR9kyYGx5tgLC8YFakBoBkgUJvgIEVBsSz4InPQfZjFchw31+wYgeuQykLA7OpE3kHbPv8WlXf+n9Rt0fguGJnLJcGT1WzdeG1Y7njDC6mj92pNUoLr8KvoE7Qq/i5Wt3PAOWP4/lUywpbPVPso5z8h6vo99mhdg3N/zs8sL5jEfCWGyGAvoxGI91JxDBFE9GJTNwI6nrFx9Qb2lw8JUnO6L/yPj2dBKd3zdAgikg6Wh8NqA0Gb9RbWGk6zidsO1y8mvrg9y1r20MXkFYsHMMrcslym2yvRVj2zJeLOPDA8S4knsY9UGudE8E= alex.nhr.fau.de
fingerprints of the SSH public host keys of alex.nhr.fau.de (as of 11/2021)
1024 SHA256:4f7EsRXG9U6L7ZnGfQYV+IFLsF1l7wfGrf9zGZMgl7A alex.nhr.fau.de (DSA) 256 SHA256:0lZv5WzJGvZkdP+zGZY9bKhPucKyhLQIkJzsC9y0T00 alex.nhr.fau.de (ECDSA) 256 SHA256:53K9MoZ920hbooWthNUv84ubES6kpxjkVSiy0kcoYc8 alex.nhr.fau.de (ED25519) 3072 SHA256:kA0Or+7QAuRikKp6MzDQNeAxg1j4NV/1hbkp9IlKJXw alex.nhr.fau.de (RSA)
SSH public host keys of cshpc.rrze.fau.de (as of 11/2021)
ssh-dss AAAAB3NzaC1kc3MAAACBAO2L8+7bhJm7OvvJMcdGSJ5/EaxvX5RRzE9RrB8fx5H69ObkqC6Baope4rOS9/+2gtnm8Q3gZ5QkostCiKT/Wex0kQQUmKn3fx6bmtExLq8YwqoRXRmNTjBIuyZuZH9w/XFK36MP63p/8h7KZXvkAzSRmNVKWzlsAg5AcTpLSs3ZAAAAFQCD0574+lRlF0WONMSuWeQDRFM4vwAAAIEAz1nRhBHZY+bFMZKMjuRnVzEddOWB/3iWEpJyOuyQWDEWYhAOEjB2hAId5Qsf+bNhscAyeKgJRNwn2KQMA2kX3O2zcfSdpSAGEgtTONX93XKkfh6JseTiFWos9Glyd04jlWzMbwjdpWvwlZjmvPI3ATsv7bcwHji3uA75PznVUikAAACBANjcvCxlW1Rjo92s7KwpismWfcpVqY7n5LxHfKRVqhr7vg/TIhs+rAK1XF/AWxyn8MHt0qlWxnEkbBoKIO5EFTvxCpHUR4TcHCx/Xkmtgeq5jWZ3Ja2bGBC3b47bHHNdDJLU2ttXysWorTXCoSYH82jr7kgP5EV+nPgwDhIMscpk cshpc.rrze.fau.de ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBNVzp97t3CxlHtUiJ5ULqc/KLLH+Zw85RhmyZqCGXwxBroT+iK1Quo1jmG6kCgjeIMit9xQAHWjS/rxrlI10GIw= cshpc.rrze.fau.de ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPSIFF3lv2wTa2IQqmLZs+5Onz1DEug8krSrWM3aCDRU cshpc.rrze.fau.de 1024 35 135989634870042614980757742097308821255254102542653975453162649702179684202242220882431712465065778248253859082063925854525619976733650686605102826383502107993967196649405937335020370409719760342694143074628619457902426899384188195801203193251135968431827547590638365453993548743041030790174687920459410070371 cshpc.rrze.fau.de ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAs0wFVn1PN3DGcUtd/JHsa6s1DFOAu+Djc1ARQklFSYmxdx5GNQMvS2+SZFFa5Rcw+foAP9Ks46hWLo9mOjTV9AwJdOcSu/YWAhh+TUOLMNowpAEKj1i7L1Iz9M1yrUQsXcqDscwepB9TSSO0pSJAyrbuGMY7cK8m6//2mf7WSxc= cshpc.rrze.fau.de
fingerprints of the SSH public host keys of cshpc.rrze.fau.de (as of 11/2021)
1024 SHA256:A82eA7py46zE/TrSTCRYnJSW7LZXY16oOBxstJF3jxU cshpc.rrze.fau.de (DSA) 256 SHA256:wFaDywle3yJvygQ4ZAPDsi/iSBTaF6Uoo0i0z727aJU cshpc.rrze.fau.de (ECDSA) 256 SHA256:is52MRsxMgxHFn58o0ZUh8vCzIuE2gYanmhrxdy0rC4 cshpc.rrze.fau.de (ED25519) 1024 SHA256:Za1mKhTRFDXUwn7nhPsWc7py9a6OHqS2jin01LJC3ro cshpc.rrze.fau.de (RSA)
While it is possible to ssh directly to a compute node, users are only allowed to do this while they have a batch job running there. When all batch jobs of a user on a node have ended, all of their processes, including any open shells, will be killed automatically. If you have multiple batch jobs running on the same node, the ssh process will be added to the cgroup of one automatically selected job and only the GPUs of that job will be visible.
The login nodes can access the Internet through NAT; the compute nodes cannot!
Software environment
The login and compute nodes run AlmaLinux8 (which is basically Redhat Enterprise Linux 8 without the support).
The login shell for all users on Alex is always bash
and cannot be changed.
As on many other HPC systems, environment modules are used to facilitate access to software packages. Type “module avail
” to get a list of available packages. Even more packages will become visible once one of the 000-all-spack-pkgs
modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager; this includes the CUDA toolkit (module “cuda”) and the cuDNN library (module “cudnn”). Only the Nvidia device driver is installed as part of the operating system.
General notes on how to use certain software on our systems (including in some cases sample job scripts) can be found on the Special applications, and tips & tricks pages. Specific notes on how some software provided via modules on the Alex cluster has been compiled, can be found in the following accordion:
Intel tools (compiler, MPI, MKL, TBB)
The modules intel
(and the Spack internal intel-oneapi-compilers
) provides the legacy Intel compilers icc
, icpc
, and ifort
as well as the new LLVM-based ones (icx
, icpx
, dpcpp
, ifx
).
Recommended compiler flags are: -O3 -mavx2 -mfma
The modules intelmpi
(and the Spack internal intel-oneapi-mpi
) provides Intel MPI. To use the legacy Intel compilers with Intel MPI you just have to use the appropriate wrappers with the Intel compiler names, i.e. mpiicc
, mpiicpc
, mpiifort
. To use the new LLVM-based Intel compilers with Intel MPI you have to specify them explicitly, i.e use mpiicc -cc=icx
, mpiicpc -cxx=icpx
, or mpiifort -fc=ifx
. The execution of mpicc
, mpicxx
, and mpif90
results in using the GNU compilers.
The modules mkl
and tbb
(and the Spack internal intel-oneapi-mkl
, and intel-oneapi-tbb
) provide Intel MKL and TBB. Use Intel’s MKL link line advisor to figure out the appropriate command line for linking with MKL. The Intel MKL also includes drop-in wrappers for FFTW3.
Alex has AMD processors, thus, Intel MKL might not give optimal performance in all cases – but usually still delivers better than most other mathematical libraries. In previous versions of Intel MKL, setting the environment variables MKL_DEBUG_CPU_TYPE=5
and MKL_CBWR=AUTO
improved performance on AMD processors. This no longer works with recent MKL versions, see also https://github.com/tensorflow/tensorflow/issues/49113 and https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html. NHR@FAU does not promote these workarounds; however, if you nevertheless follow them by setting LD_PRELOAD
do not forget to still set MKL_CBWR=AUTO
.
Further Intel tools may be added in the future.
The Intel modules on Fritz, Alex and the Slurm-TinyGPU/TinyFAT behave different than on the older RRZE systems: (1) The intel64
module has been renamed to intel
and no longer automatically loads intel-mpi
and mkl
. (2) intel-mpi/VERSION-intel
and intel-mpi/VERSION-gcc
have been unified into intel-mpi/VERSION
. The selection of the compiler occurs by the wrapper name, e.g. mpicc
= GCC, mpiicc
= Intel; mpif90
= GFortran; mpiifort
= Intel.
Nvidia compilers (CUDA and formerly PGI)
The CUDA compilers are part of the cuda
modules.
The Nvidia (formerly PGI) compilers are part of the nvhpc
modules.
Multi-Process Service (MPS daemon)
The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs. This can benefit performance when the GPU compute capacity is underutilized by a single application process.
Using MPS with single-GPU jobs
# set necessary environment variables and start the MPS daemon export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps.$SLURM_JOB_ID export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log.$SLURM_JOB_ID nvidia-cuda-mps-control -d # do your work (a.out is just a placeholder) ./a.out -param 1 & ./a.out -param 2 & ./a.out -param 3 & ./a.out -param 4 & wait # stop the MPS daemon echo quit | nvidia-cuda-mps-control
Using MPS with multi-GPU jobs
# set necessary environment variables and start the MPS daemon
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
echo "starting mps server for $GPU"
export CUDA_VISIBLE_DEVICES=$GPU
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log-${GPU}.$SLURM_JOB_ID
nvidia-cuda-mps-control -d
done
# do your work - you may need to set CUDA_MPS_PIPE_DIRECTORY correctly per process!!
...
# cleanup MPS
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
echo "stopping mps server for $GPU"
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
echo 'quit' | nvidia-cuda-mps-control
done
See also http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html and https://stackoverflow.com/questions/36015005/cuda-mps-servers-fail-to-start-on-workstation-with-multiple-gpus.
Open MPI
srun
instead of mpirun
is recommended.Open MPI is built using Spack:
- with the compiler mentioned in the module name; the corresponding compiler will be loaded as dependency when the Open MPI modules is loaded
- with support for CUDA (cuda/11.5 as of 11/2021)
- without support for thread-multiple
- with fabrics=ucx
- with support for Slurm as scheduler (and internal PMIx of Open MPI)
Python, conda environments, Tensorflow, and Pytorch
Do not rely on the Python installation from the operating system. Use our python
modules instead. These installations will be updated in place from time to time. We can add further packages from the Miniconda distribution as needed.
You can modify the Python environment as follows:
Set the location where pip and conda install packages to $WORK
, see Python and Jupyter for details. By default packages will be installed in $HOME
, which has limited capacity.
Extend the base environment
$ pip install --user <packages>'
Create a new one of your own
$ conda create -n <environment_name> <packages>'
Clone and modify this environment
$ conda create --name myclone --clone base
$ conda install --name myclone new_package
See also https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html.
We also provide some specialized Python modules, e.g., python/pytorch-1.10py3.9
and python/tensorflow-2.7.0py3.9
(as of Jan 2022).
If you do not use these modules, note that cuda and cudnn are separate modules and that the login nodes do not have a GPU installed.
It is not recommended for security reasons to run TensorBoard on a multi-user system. ThensorBoard does not come with any means of access control and anyone with access to the multi-user system can attach to your TensorBoard port and act as you! (It might only need some effort to find the port if you do not use the default port.) There is nothing NHR@FAU can do to mitigate these security issues. Even the hint about We patched the preinstalled TensorBoard version according to https://github.com/tensorflow/tensorboard/pull/5570 using a hash will be enforced.--host localhost
in https://github.com/tensorflow/tensorboard/issues/260#issuecomment-471737166 does not help on a multi-user system. The suggestion from https://github.com/tensorflow/tensorboard/issues/267#issuecomment-671820015 does not help either on a multi-user system.
However, we recommend using TensorBoard on your local machine with the HPC-filesystem mounted (e.g. sshfs).
Arm DDT
Arm DDT is a powerful parallel debugger. NHR@FAU holds a license for 32 processes and 4 GPUs.
Amber
NHR@FAU holds a “compute center license” of Amber 20 and 22, thus, Amber is generally available to everyone for non-profit use, i.e. for academic research.
Amber usually delivers the most economic performance if only one GPGPU is used. The correct PMEMD binary then is pmemd.cuda
.
The amber/20p12-at21p11-ompi-gnu-cuda11.5
module from 11/2021 contains the additional bug fix discussed in http://archive.ambermd.org/202110/0210.html / http://archive.ambermd.org/202110/0218.html.
Gromacs
We provide Gromacs versions without and with PLUMED. Gromacs (and PLUMED) are built using Spack.
Gromacs usually delivers the most economic performance if only one GPGPU is used together with the Thread-MPI implementation of Gromacs (no mpirun
needed; the number of processes is specified directly using the gmx mdrun
command line argument -ntmpi
) are used. Therefore, a “real” MPI version of Gromacs is only provided together with PLUMED. In that case the binary name is gmx_mpi
and must be started with srun
or mpirun
like any other MPI program.
Do not start gmx mdrun
with the option -v
. The verbose output will only create extra large Slurm stdout files and your jobs will suffer if the NFS servers have high load. There is also only very limited use to see in the stdout all the time when the job is expected to reach the specified number of steps.
LAMMPS
The modules lammps/20201029-gcc10.3.0-openmpi-mkl-cuda
and lammps/20211027-gcc10.3.0-openmpi-mkl-cuda
have been compiled using Gcc-10.3.0, Intel OneAPI MKL, Open MPI 4.1.1, and with
- GPU package API: CUDA; GPU package precision: mixed; for sm_80
- KOKKOS package API: CUDA OpenMP Serial; KOKKOS package precision: double; for sm_80
- Installed packages
- for 20201029: ASPHERE BODY CLASS2 COLLOID COMPRESS CORESHELL DIPOLE GPU GRANULAR KIM KOKKOS KSPACE LATTE MANYBODY MC MISC MOLECULE MPIIO PERI POEMS PYTHON QEQ REPLICA RIGID SHOCK SNAP SPIN SRD USER-ATC USER-H5MD USER-LB USER-MEAMC USER-MISC USER-NETCDF USER-OMP USER-REAXC VORONOI
- for 20211027: ASPHERE BODY CLASS2 COLLOID COMPRESS CORESHELL DIPOLE GPU GRANULAR KIM KOKKOS KSPACE LATTE MANYBODY MC MISC MOLECULE MPIIO PERI POEMS PYTHON QEQ REPLICA RIGID SHOCK SPIN SRD VORONOI
The LAMMPS binary is called just lmp
.
Run module avail lammps
to see all currently installed Lammps modules. Allocate an interactive job and run mpirun -np 1 lmp -help
to see which Lammps packages have been included in a specific build.
NAMD
NAMD comes with a license which prohibits us to “just install and everyone can use it”. We, therefore, need individual users to print and sign the NAMD license. Subsequently, we will set the permissions accordingly.
At the moment, we provide the official pre-built Linux-x86_64-multicore-CUDA (NVIDIA CUDA acceleration) binary.
VASP
VASP comes with a license which prohibits us to “just install and everyone can use it”. We have to individually check each VASP user.
At the moment we provide two different VASP 6.2.3 and 6.3.0 modules to eligible users:
vasp/6.x.y-nccl
– NCCL stands for Nvidia Collective Communication Library and is basically a library for direct GPU to GPU communication. However, NCCL only allows one one MPI rank per GPU. In 6.2.1 you can disable NCCL via the Input-file, but sadly the testsuite will still fail.vasp/6.x.y-nonccl
– in certain cases, one MPI rank per GPU is not enough to saturate a single A100. When you use multiple ranks per GPU, you should also use the so called MPS server. See “Multi-Process Service (MPS daemon)” above on how to start MPS even in case of multiple GPUs.
The VASP 6.3.0 has been compiled with the new support for HDF5.
Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack
module to make use of the packages we already build with Spack if the concretization match instead of starting from scratch. Once user-spack
is loaded, the command spack
will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK
).
You can also bring your own environment in a container using Singularity/Apptainer.
File Systems
The following table summarizes the available file systems and their features. It is only an excerpt from the description of the HPC file system.
Further details will follow.
Quota in $HOME
is very limited as snapshots are made every 30 minutes. Put simulation data to $WORK
! Do not rely on the specific path of $WORK
as this may change over time when your work directory is relocated to a different NFS server.
Batch processing
As with all production clusters at RRZE, resources are controlled through a batch system. The front ends can be used for compiling and very short serial test runs which do not require a GPU, but everything else has to go through the batch system to the cluster.
Alex uses SLURM as a batch system. Please see our general batch system description for further details.
For every batch job, you have to specify the number of GPUs that should be allocated to your job. For single-node jobs, the compute nodes are not allocated exclusively but are shared among several jobs – the GPUs themselves are always granted exclusively. Resources are granted on a per-GPU basis. The corresponding share of the resources of the host system (CPU cores, RAM) is automatically allocated: for each A40 you get 16 CPU cores and 60 GB RAM on the host assigned; for each A100 you get 16 CPU cores and 120 GB RAM on the host assigned.
If your application is able to use more than one node and its corresponding GPUs efficiently, multi-node jobs are available on demand. In this case, the nodes will be allocated exclusively for your job, i.e. you get access to all GPUs, CPUs and RAM of the node automatically.
Partition | min – max walltime | min – max GPUs | –gres (with # being the number requested of GPUs) | automatically assigned cores | automatically assigned host memory | Comments |
---|---|---|---|---|---|---|
a40 |
0 – 24:00:00
(max. 6h for interactive jobs) |
1-8 | --gres=gpu:a40:# |
16 per GPU | 60 GB per GPU | Jobs run on a node with Nvidia A40 GPGPUs; the GPGPUs are exclusive but the node may be shared and jobs are confined to their cgroup. |
a100 |
0 – 24:00:00
(max. 6h for interactive jobs) |
1-8 | --gres=gpu:a100:# |
16 per GPU | 120 GB per GPU | Jobs run on a node with Nvidia A100 GPGPUs; the GPGPUs are exclusive but the node may be shared and jobs are confined to their cgroup. |
Multi-node jobs are only available on demand for NHR projects but not within the free Tier3/FAU access. As NHR project, contact us if you are interested.
Example Slurm Batch Scripts
For the most common use cases, examples are provided below. Please see our general batch system description for further details. Note that these scripts possibly have to be adapted to your specific application and use case!
Interactive job (single-node)
Interactive jobs can be requested by using salloc
instead of sbatch
and specifying the respective options on the command line.
The following will give you an interactive shell on one of the A40 nodes for one hour:
salloc --gres=gpu:a40:1 --partition=a40 --time=01:00:00
Note that settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!
MPI parallel job (single-node)
In this example, the executable will be run using 16 MPI processes (i.e. one per physical core) for a total job walltime of 6 hours. The job allocates one A40 GPU and the corresponding share of CPUs and main memory ( 16 cores and 60GB RAM) automatically.
#!/bin/bash -l # #SBATCH --ntasks=16 #SBATCH --time=06:00:00 #SBATCH --gres=gpu:a40:1 #SBATCH --partition=a40 #SBATCH --export=NONE unset SLURM_EXPORT_ENV module load XXX srun ./cuda_application
Hybrid MPI/OpenMP job (single-node)
In this example, one A100 GPU is allocated. The executable will be run using 2 MPI processes with 8 OpenMP threads each for a total job walltime of 6 hours. 16 cores are allocated in total and each OpenMP thread is running on a physical core.
For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores
, OMP_PROC_BIND=true
. For more information, see e.g. the HPC Wiki.
#!/bin/bash -l # #SBATCH --ntasks=2 #SBATCH --cpus-per-task=8 #SBATCH --gres=gpu:a100:1 #SBATCH --partition=a100 #SBATCH --time=06:00:00 #SBATCH --export=NONE unset SLURM_EXPORT_ENV module load XXX # cpus-per-task has to be set again for srun export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK # set number of threads to requested cpus-per-task export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./cuda_application
OpenMP job (single-node)
In this example, the executable will be run using 16 OpenMP threads for a total job walltime of 6 hours. One A100 GPU and the corresponding 16 cores are allocated in total.
For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores
, OMP_PROC_BIND=true
. For more information, see e.g. the HPC Wiki.
#!/bin/bash -l # #SBATCH --cpus-per-task=16 #SBATCH --gres=gpu:a100:1 #SBATCH --partition=a100 #SBATCH --time=06:00:00 #SBATCH --export=NONE unset SLURM_EXPORT_ENV module load XXX # set number of threads to requested cpus-per-task export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./cuda_application
Multi-node Job (available on demand for NHR projects)
In this case, your application has to be able to use more than one node and its corresponding GPUs at the same time. The nodes will be allocated exclusively for your job, i.e. you get access to all GPUs, CPUs and RAM of the node automatically.
Adjust the options --nodes
and--ntasks-per-node
to your application. Since the nodes are allocated exclusively, there is usually no need to specify --cpus-per-task
for hybrid OpenMP/MPI applications. However, the correct value of $OMP_NUM_THREADS
has to be set explicitly since OpenMP is not Slurm-aware!
#!/bin/bash -l #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --gres=gpu:a100:8 #SBATCH --partition=a100 #SBATCH --qos=a100multi #SBATCH --time=01:00:00 #SBATCH --export=NONE unset SLURM_EXPORT_ENV module load XXX srun ./cuda_application
Attach to a running job
On the frontend node, the following steps are necessary:
- Check on which node the job is running with
squeue
. - If you have only one job running on a node, you can use
ssh <nodename>
to connect to it. You will be placed in the allocation of the job. - If you have multiple jobs running on a node, use
srun --jobid=<jobid> --overlap --pty /bin/bash
to attach to a specific job.
Attaching to a running job can be used e.g. to check GPU utilization via nvidia-smi
. For more information on nvidia-smi
and GPU profiling, see Working with NVIDIA GPUs.
Further Information
AMD EPYC 7713 “Milan” Processor
Each node has two processor chips. The specs per processor chip are as follows:
- # of CPU Cores: 64
- # of Threads: 128 – hyperthreading (SMT) is disabled on Alex for security reasons; thus, threads and physical cores are identical
- Max. Boost Clock: Up to 3.675 GHz
- Base Clock: 2.0 GHz
- Default TDP: 225W; AMD Configurable TDP (cTDP): 225-240W
- Total L3 Cache: 256MB
- System Memory Type: DDR4 @ 3,200 MHz
- Memory Channels: 8 – these can be arranged in 1-4 ccNUMA domains (“NPS” setting); Alex is running with NPS=4
- Theoretical per Socket Mem BW: 204.8 GB/s
Specs of an Nvidia A40 vs. A100 GPGPU
A40 | A100 (SMX) | |
GPU architecture | Ampere; SM_86 , compute_86 |
Ampere; SM_80 , compute_80 |
GPU memory | 48 GB GDDR6 with ECC (ECC disabled on Alex) | 40GB HBM2 / 80 GB HBM2 |
Memory bandwidth | 696 GB/s | 1,555 GB/s / 2,039 GB/s |
Interconnect interface | PCIe Gen4 31.5 GB/s (bidirectional) | NVLink: 600GB/s |
CUDA Cores (Ampere generation) | 10,752 (84 SMs) | 6,912 (108 SMs) |
RT Cores (2nd generation) | 84 | |
Tensor Cores (3rd generation) | 336 | 432 |
FP64 TFLOPS (non-Tensor) | 0.5 | 9.7 |
FP64 Tensor TFLOPS | 19.5 | |
Peak FP32 TFLOPS (non-Tensor) | 37.4 | 19.5 |
Peak TF32 Tensor TFLOPS | 74.8 | 156 |
Peak FP16 Tensor TFLOPS with FP16 Accumulate | 149.7 | 312 |
Peak BF16 Tensor TFLOPS with FP32 Accumulate | 149.7 | 312 |
RT Core performance TFLOPS | 73.1 | ? |
Peak INT8 Tensor TOPS Peak INT 4 Tensor TOPS |
299.3 598.7 |
624 1,248 |
Max power consumption | 300 W | 400 W |
Price | $$ | $$$$ |
A40 data taken from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a40/proviz-print-nvidia-a40-datasheet-us-nvidia-1469711-r8-web.pdf (11/2021).
A100 data taken from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf (11/2021).
Nvidia A40 GPGPU nodes

The Nvidia A40 GPGPUs (like the Geforce RTX 3080 consumer cards) belong to the Ampere generation. The native architecture is SM86
or SM_86
, compute_86
.
All eight A40 GPGPUs of a node are connected to two PCIe switches. Thus, there is only limited bandwidth to the host system and also between the GPGPUs.
“Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.” (according to https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32)
Topology of the octo A40 nodes according to nvidia-smi topo -m
; AMD Milan processor in NPS=1 mode:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-63 0 GPU1 NODE X NODE NODE SYS SYS SYS SYS SYS SYS 0-63 0 GPU2 NODE NODE X NODE SYS SYS SYS SYS SYS SYS 0-63 0 GPU3 NODE NODE NODE X SYS SYS SYS SYS SYS SYS 0-63 0 GPU4 SYS SYS SYS SYS X NODE NODE NODE NODE NODE 64-127 1 GPU5 SYS SYS SYS SYS NODE X NODE NODE NODE NODE 64-127 1 GPU6 SYS SYS SYS SYS NODE NODE X NODE PHB PHB 64-127 1 GPU7 SYS SYS SYS SYS NODE NODE NODE X NODE NODE 64-127 1 mlx5_0 SYS SYS SYS SYS NODE NODE PHB NODE X PIX 25 GBE mlx5_1 SYS SYS SYS SYS NODE NODE PHB NODE PIX X (25 GbE not connected) Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge
Topology of the octo A40 nodes according to nvidia-smi topo -m
; AMD Milan processor in NPS=4 mode (current setting of Alex):
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63 3 GPU1 SYS X SYS SYS SYS SYS SYS SYS SYS SYS 32-47 2 GPU2 SYS SYS X SYS SYS SYS SYS SYS SYS SYS 16-31 1 GPU3 SYS SYS SYS X SYS SYS SYS SYS SYS SYS 0-15 0 GPU4 SYS SYS SYS SYS X SYS SYS SYS SYS SYS 112-127 7 GPU5 SYS SYS SYS SYS SYS X SYS SYS SYS SYS 96-111 6 GPU6 SYS SYS SYS SYS SYS SYS X SYS PHB PHB 80-95 5 GPU7 SYS SYS SYS SYS SYS SYS SYS X SYS SYS 64-79 4 mlx5_0 SYS SYS SYS SYS SYS SYS PHB SYS X PIX 25 GBE mlx5_1 SYS SYS SYS SYS SYS SYS PHB SYS PIX X (25 GbE not connected) Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge
Nvidia A100 GPGPU nodes

The Nvidia A100 GPGPUs belong to the Ampere generation. The native architecture is SM80
or SM_80
, compute_80
.
All four or eight A100 GPGPUs of a node are directly connected with each other through an NVSwitch providing 600 GB/s GPU-to-GPU bandwidth for each GPGPU.
Topology of the quad A100 nodes according to nvidia-smi topo -m
; no 25 GbE / HDR200 cards yet; AMD Rome processor in NPS=2 mode
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X NV4 NV4 NV4 32-63 1 GPU1 NV4 X NV4 NV4 0-31 0 GPU2 NV4 NV4 X NV4 96-127 3 GPU3 NV4 NV4 NV4 X 64-95 2 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) NV# = Connection traversing a bonded set of # NVLinks
Topology of the octo A100 nodes according to nvidia-smi topo -m
; AMD Milan processor in NPS=1 mode
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB NODE NODE SYS 0-63 0 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB NODE NODE SYS 0-63 0 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 NODE PXB PXB SYS 0-63 0 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 NODE PXB PXB SYS 0-63 0 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS NODE 64-127 1 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS NODE 64-127 1 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS PXB 64-127 1 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS PXB 64-127 1 mlx5_0 PXB PXB NODE NODE SYS SYS SYS SYS X NODE NODE SYS HDR200 mlx5_1 NODE NODE PXB PXB SYS SYS SYS SYS NODE X PIX SYS 25 GbE mlx5_2 NODE NODE PXB PXB SYS SYS SYS SYS NODE PIX X SYS (25 GbE not connected) mlx5_3 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS X HDR200 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) NV# = Connection traversing a bonded set of # NVLinks
Topology of the octo A100 nodes according to nvidia-smi topo -m
; AMD Milan processor in NPS=4 mode (current setting of Alex):
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 48-63 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 48-63 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS PXB PXB SYS 16-31 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS PXB PXB SYS 16-31 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS 112-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS 112-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS PXB 80-95 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS PXB 80-95 5 mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS SYS HDR200 mlx5_1 SYS SYS PXB PXB SYS SYS SYS SYS SYS X PIX SYS 25 GbE mlx5_2 SYS SYS PXB PXB SYS SYS SYS SYS SYS PIX X SYS (25 GbE not connected) mlx5_3 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS X HDR200 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
Since 2023-04-28, SVM and IOMMU/VT-d are disabled in the BIOS to better work with Multi Node NCCL.