• Skip navigation
  • Skip to navigation
  • Skip to the bottom
Simulate organization breadcrumb open Simulate organization breadcrumb close
  • FAUTo the central FAU website
  • RRZE
  • NHR-Verein e.V.
  • Gauß-Allianz

Navigation Navigation close
  • News
  • About us
    • People
    • Funding
    • NHR Compute Time Projects
    • Tier3 User Project Reports
    • Success Stories from the Support
    • Annual Report
    • Jobs
    Portal About us
  • Research
    • Research Focus
    • Publications, Posters and Talks
    • Software & Tools
    • HPC Performance Lab
    • Atomic Structure Simulation Lab
    • NHR PerfLab Seminar
    • Projects
    • Awards
    Portal Research
  • Teaching & Training
    • Lectures and Seminars
    • Tutorials & Courses
    • Theses
    • HPC Café
    • Student Cluster Competition
    Portal Teaching & Training
  • Systems & Services
    • Systems, Documentation & Instructions
    • Support & Contact
    • Training Resources
    • Summary of System Utilization
    Portal Systems & Services
  • FAQ

  1. Home
  2. Systems & Services
  3. Systems, Documentation & Instructions
  4. HPC clusters & systems
  5. Alex GPGPU cluster (NHR+Tier3)

Alex GPGPU cluster (NHR+Tier3)

In page navigation: Systems & Services
  • Systems, Documentation & Instructions
    • Getting started with HPC
      • NHR@FAU HPC-Portal Usage
    • Job monitoring with ClusterCockpit
    • NHR application rules – NHR@FAU
    • HPC clusters & systems
      • Dialog server
      • Alex GPGPU cluster (NHR+Tier3)
      • Fritz parallel cluster (NHR+Tier3)
      • Meggie parallel cluster (Tier3)
      • Emmy parallel cluster (Tier3)
      • Woody(-old) throughput cluster (Tier3)
      • Woody throughput cluster (Tier3)
      • TinyFat cluster (Tier3)
      • TinyGPU cluster (Tier3)
      • Test cluster
      • Jupyterhub
    • SSH – Secure Shell access to HPC systems
    • File systems
    • Batch Processing
      • Job script examples – Slurm
      • Advanced topics Slurm
    • Software environment
    • Special applications, and tips & tricks
      • Amber/AmberTools
      • ANSYS CFX
      • ANSYS Fluent
      • ANSYS Mechanical
      • Continuous Integration / Gitlab Cx
        • Continuous Integration / One-way syncing of GitHub to Gitlab repositories
      • CP2K
      • CPMD
      • GROMACS
      • IMD
      • Intel MKL
      • LAMMPS
      • Matlab
      • NAMD
      • OpenFOAM
      • ORCA
      • Python and Jupyter
      • Quantum Espresso
      • R and R Studio
      • Spack package manager
      • STAR-CCM+
      • Tensorflow and PyTorch
      • TURBOMOLE
      • VASP
        • Request access to central VASP installation
      • Working with NVIDIA GPUs
      • WRF
  • Support & Contact
    • HPC Performance Lab
    • Atomic Structure Simulation Lab
  • HPC User Training
  • HPC System Utilization

Alex GPGPU cluster (NHR+Tier3)

FAU’s Alex cluster (system integrator: Megware) is a high-performance compute resource with Nvidia GPGPU accelerators and partially high speed interconnect. It is intended for single and multi GPGPU workloads, e.g. from molecular dynamics, or machine learning. Alex serves for both, FAU’s basic Tier3 resources as well as NHR’s project resources.

  • 2 front end nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 512 GB of RAM, and 100 GbE connection to RRZE’s network backbone but no GPGPUs.
  • 20 GPGPU nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 1,024 GB of DDR4-RAM, eight Nvidia A100 (each 40 GB HBM2 @ 1,555 GB/s; HGX board with NVLink; 9.7 TFlop/s in FP64 or 19.5 TFlop/s in FP32), two HDR200 Infiniband HCAs, 25 GbE, and 14 TB on local NVMe SSDs.
  • 15 GPGPU nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 2,048 GB of DDR4-RAM, eight Nvidia A100 (each 80 GB HBM2 @ 2,039 GB/s; HGX board with NVLink; 9.7 TFlop/s in FP64 or 19.5 TFlop/s in FP32), two HDR200 Infiniband HCAs, 25 GbE, and 14 TB on local NVMe SSDs.
  • 38 GPGPU nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3Cache per chip, 512 GB of DDR4-RAM, eight Nvidia A40 (each with 48 GB DDR6 @ 696 GB/s; 37.42 TFlop/s in FP32), 25 GbE, and 7 TB on local NVMe SSDs.

So there is a total of 304 Nvidia A40, 160 Nvidia A100/40GB, and 120 Nvidia A100/80GB GPGPUs. The Nvidia A40 GPGPUs have a very high single precision floating point performance (even higher than an A100!) and are much less expensive than Nvidia A100 GPGPUs. All workloads which only require single precision floating point operations, like many molecular dynamics applications, thus, should target the Nvidia A40 GPGPUs.

Alex complements RRZE’s TinyGPU cluster. Alex addresses high-end GPGPU workloads while TinyGPU mainly comes with consumer GPUs of different generations which, nevertheless, provide an invincible price-performance ratio for single precision floating point applications which require only little GPU memory. TinyGPU currently also includes 8 nodes with a total of 32 Nvidia A100/40GB GPGPUs; these nodes may at a later point in time be moved to Alex.

On 160 Nvidia A100/40GB GPGPUs, a LINPACK performance of 1.73 PFlop/s has been measured in January 2022.
On 160 Nvidia A100/40GB plus 96 Nvidia A100/80GB GPGPUs (i.e. 32 nodes), a LINPACK performance of 2.938 PFlop/s has been measured in May 2022 resulting in place 184 of the June 2022 Top500 list and place 17 in the Green500 of June 2022.
On 160 Nvidia A100/40GB plus 120 Nvidia A100/80GB GPGPUs (i.e. 35 nodes), a LINPACK performance of 3.24 PFlop/s has been measured in Oct. 2022 resulting in place 174 of the Nov. 2022 Top500 list and place 33 in the Green500 of Nov. 2022.

The name “Alex” is a play with the name of FAU’s early benefactor Alexander, Margrave of Brandenburg-Ansbach (1736-1806).

Alex has been financed by:

  • German Research Foundation (DFG) as part of INST 90/1171-1 (440719683),
  • NHR funding of federal and state authorities (BMBF and Bavarian State Ministry of Science and the Arts, respectively),
  • seven A100 nodes are dedicated to HS Coburg as part of the BMBF proposal “HPC4AAI” within the call “KI-Nachwuchs@FH”,
  • one A100 nodes is financed by and dedicated to an external group from Erlangen,
  • and financial support of FAU to strengthen HPC activities.

This website shows information regarding the following topics:

  • Access, User Environment, File Systems
    • Access to the machine
    • Software environment
    • File systems
    • Batch processing
  • Further Information
    • technical data AMD EPYC 7713 “Milan” Processor
    • Specs of an Nvidia A40 vs. A100 GPGPU
    • technical data Nvidia A40 GPGPU nodes
    • technical data Nvidia A100 GPGPU nodes

Access, User Environment, and File Systems

Access to the machine

Note that FAU HPC accounts are not automatically enabled for Tier3 access to Alex. To request Tier3 access to Alex, you need to work on a project with extended demands, thus, not feasible on TinyGPU, but sill below the NHR thresholds. You have proof that extended demand, the efficiency of your jobs, and provide a short description of what you want to do there: https://hpc.fau.de/tier3-access-to-alex/.

The rules for NHR access are described on our page on NHR application rules.

Users can connect to alex.nhr.fau.de (keep the “nhr” instead of “rrze” in mind!) and will be randomly routed to one of the two front ends. All systems in the cluster, including the front ends, have private IPv4 addresses in the 10.28.52.0/23 and IPv6 addresses in the 2001:638:a000:3952::/64 range. They can normally only be accessed directly from within the FAU networks. There is one exception: If your internet connection supports IPv6, you can directly ssh to the front  ends (but not to the compute nodes). Otherwise, if you need access from outside of FAU, you usually have to connect for example to the dialog server cshpc.rrze.fau.de first and then ssh to alex.nhr.fau.de from there. For HPC protal/NHR users we provide a template that can be added to the local .ssh/config.

SSH public host keys of the (externally) reachable hosts

SSH public host keys of alex.nhr.fau.de (as of 11/2021)

ssh-dss AAAAB3NzaC1kc3MAAACBAO/DMbHuyYO6vWXgoeFgaVXFIbg6vldW3ViGOJSd/yVopqhB/fxdp4z1SioML9YOSNepr58xpgoFXFpM+DgRgwcIMBYbV3CeyPYoF4ZAvVwkQLGZh5zmn1Zxd6U3B49aZaEYnItRO6VKGW/Bm6cKY3H+FW5NUa8u+CQOjbjCmixBAAAAFQDpdsCURZAgCd8durljTJHF2AMR+wAAAIAWxlbOXYcMdgmYWE7Af3CyKysbaC1whHNiWOK3v4b0HEZ3CWQe50rrZWDzTKyal0AkncghPMusz5hqZCbZC3DrAParSTwk8RGXsbRm6O/cF3JBKP6IhIBvc8kEVaqFeyDuFwMXwzwQU8x4esAkIu+GDCiCADlhiGSf2Uw6pEds+gAAAIEAgVxOFD9eFM+pDMSw/NlyVXSVA512uC4/JnrHDfY+6SjhuOcfd5JOWjDNYxKO0xPruj0H/TpAI+h90/yUHUff9F/g8rPg9S55DtsUyJHY8B9mm7/mKnJfcT68EBheH00Vl4yGLFu6q4mmfwoHVSoV7QikVj5vOlmWOVMjJen8NKc= alex.nhr.fau.de
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCIUVUt/y9dOrXP3aZ3meUF8s77/d+sk/F31tMnw2TNL4mk6J5Ylk2SOtDL7GTCrxmj3/RXMrrPCKO8FDJR2SzE= alex.nhr.fau.de
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINPXIVMupI341xGq6Gb5agSqurqTSssyBORWKx3wAU0p alex.nhr.fau.de
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDMOn7p0mPhTeZndNjnLIF7RKeA+WaXz4vJ0lFEo7cpXV9I5AmbduM/GkzEdGNAvVgWmcYtW3R53R23c1eikSFFx6aUaK1rb0kp2SYlh+JUvXRLIg+oIK47Do3lQ7qDas1Q7U9wssHr1wrs5g6dsQj+v7UFJcCAqcAz4KfxaJrG8MkYpI0P38TSe3p39+ObDv+NoBKobHgR9kyYGx5tgLC8YFakBoBkgUJvgIEVBsSz4InPQfZjFchw31+wYgeuQykLA7OpE3kHbPv8WlXf+n9Rt0fguGJnLJcGT1WzdeG1Y7njDC6mj92pNUoLr8KvoE7Qq/i5Wt3PAOWP4/lUywpbPVPso5z8h6vo99mhdg3N/zs8sL5jEfCWGyGAvoxGI91JxDBFE9GJTNwI6nrFx9Qb2lw8JUnO6L/yPj2dBKd3zdAgikg6Wh8NqA0Gb9RbWGk6zidsO1y8mvrg9y1r20MXkFYsHMMrcslym2yvRVj2zJeLOPDA8S4knsY9UGudE8E= alex.nhr.fau.de

fingerprints of the SSH public host keys of alex.nhr.fau.de (as of 11/2021)

1024 SHA256:4f7EsRXG9U6L7ZnGfQYV+IFLsF1l7wfGrf9zGZMgl7A alex.nhr.fau.de (DSA)
256  SHA256:0lZv5WzJGvZkdP+zGZY9bKhPucKyhLQIkJzsC9y0T00 alex.nhr.fau.de (ECDSA)
256  SHA256:53K9MoZ920hbooWthNUv84ubES6kpxjkVSiy0kcoYc8 alex.nhr.fau.de (ED25519)
3072 SHA256:kA0Or+7QAuRikKp6MzDQNeAxg1j4NV/1hbkp9IlKJXw alex.nhr.fau.de (RSA)

SSH public host keys of cshpc.rrze.fau.de (as of 11/2021)

ssh-dss AAAAB3NzaC1kc3MAAACBAO2L8+7bhJm7OvvJMcdGSJ5/EaxvX5RRzE9RrB8fx5H69ObkqC6Baope4rOS9/+2gtnm8Q3gZ5QkostCiKT/Wex0kQQUmKn3fx6bmtExLq8YwqoRXRmNTjBIuyZuZH9w/XFK36MP63p/8h7KZXvkAzSRmNVKWzlsAg5AcTpLSs3ZAAAAFQCD0574+lRlF0WONMSuWeQDRFM4vwAAAIEAz1nRhBHZY+bFMZKMjuRnVzEddOWB/3iWEpJyOuyQWDEWYhAOEjB2hAId5Qsf+bNhscAyeKgJRNwn2KQMA2kX3O2zcfSdpSAGEgtTONX93XKkfh6JseTiFWos9Glyd04jlWzMbwjdpWvwlZjmvPI3ATsv7bcwHji3uA75PznVUikAAACBANjcvCxlW1Rjo92s7KwpismWfcpVqY7n5LxHfKRVqhr7vg/TIhs+rAK1XF/AWxyn8MHt0qlWxnEkbBoKIO5EFTvxCpHUR4TcHCx/Xkmtgeq5jWZ3Ja2bGBC3b47bHHNdDJLU2ttXysWorTXCoSYH82jr7kgP5EV+nPgwDhIMscpk cshpc.rrze.fau.de
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBNVzp97t3CxlHtUiJ5ULqc/KLLH+Zw85RhmyZqCGXwxBroT+iK1Quo1jmG6kCgjeIMit9xQAHWjS/rxrlI10GIw= cshpc.rrze.fau.de
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPSIFF3lv2wTa2IQqmLZs+5Onz1DEug8krSrWM3aCDRU cshpc.rrze.fau.de
1024 35 135989634870042614980757742097308821255254102542653975453162649702179684202242220882431712465065778248253859082063925854525619976733650686605102826383502107993967196649405937335020370409719760342694143074628619457902426899384188195801203193251135968431827547590638365453993548743041030790174687920459410070371 cshpc.rrze.fau.de
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAs0wFVn1PN3DGcUtd/JHsa6s1DFOAu+Djc1ARQklFSYmxdx5GNQMvS2+SZFFa5Rcw+foAP9Ks46hWLo9mOjTV9AwJdOcSu/YWAhh+TUOLMNowpAEKj1i7L1Iz9M1yrUQsXcqDscwepB9TSSO0pSJAyrbuGMY7cK8m6//2mf7WSxc= cshpc.rrze.fau.de

fingerprints of the SSH public host keys of cshpc.rrze.fau.de (as of 11/2021)

1024 SHA256:A82eA7py46zE/TrSTCRYnJSW7LZXY16oOBxstJF3jxU cshpc.rrze.fau.de (DSA)
256  SHA256:wFaDywle3yJvygQ4ZAPDsi/iSBTaF6Uoo0i0z727aJU cshpc.rrze.fau.de (ECDSA)
256  SHA256:is52MRsxMgxHFn58o0ZUh8vCzIuE2gYanmhrxdy0rC4 cshpc.rrze.fau.de (ED25519)
1024 SHA256:Za1mKhTRFDXUwn7nhPsWc7py9a6OHqS2jin01LJC3ro cshpc.rrze.fau.de (RSA)

While it is possible to ssh directly to a compute node, users are only allowed to do this while they have a batch job running there. When all batch jobs of a user on a node have ended, all of their processes, including any open shells, will be killed automatically. If you have multiple batch jobs running on the same node, the ssh process will be added to the cgroup of one automatically selected job and only the GPUs of that job will be visible.

The login nodes can access the Internet through NAT; the compute nodes cannot!

Software environment

The login and compute nodes run AlmaLinux8 (which is basically Redhat Enterprise Linux 8 without the support).

The login shell for all users on Alex is always bash and cannot be changed.

As on many other HPC systems,  environment modules are used to facilitate access to software packages. Type “module avail” to get a list of available packages. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack“ as enhanced HPC package manager; this includes the CUDA toolkit (module “cuda”) and the cuDNN library (module “cudnn”). Only the Nvidia device driver is installed as part of the operating system.

General notes on how to use certain software on our systems (including in some cases sample job scripts) can be found on the Special applications, and tips & tricks pages. Specific notes on how some software provided via modules on the Alex cluster has been compiled, can be found in the following accordion:

Intel tools (compiler, MPI, MKL, TBB)

Intel One API is installed in the “Free User” edition via Spack.

The modules intel (and the Spack internal intel-oneapi-compilers) provides the legacy Intel compilers icc, icpc, and ifort as well as the new LLVM-based ones (icx, icpx, dpcpp, ifx).

Recommended compiler flags are:  -O3 -mavx2 -mfma

The modules intelmpi (and the Spack internal intel-oneapi-mpi) provides Intel MPI. To use the legacy Intel compilers with Intel MPI you just have to use the appropriate wrappers with the Intel compiler names, i.e. mpiicc, mpiicpc, mpiifort. To use the new LLVM-based Intel compilers with Intel MPI you have to specify them explicitly, i.e use mpiicc -cc=icx, mpiicpc -cxx=icpx, or mpiifort -fc=ifx. The execution of mpicc, mpicxx, and mpif90 results in using the GNU compilers.

The modules mkland tbb(and the Spack internal intel-oneapi-mkl, and intel-oneapi-tbb) provide Intel MKL and TBB. Use Intel’s MKL link line advisor to figure out the appropriate command line for linking with MKL. The Intel MKL also includes drop-in wrappers for FFTW3.

Alex has AMD processors, thus, Intel MKL might not give optimal performance in all cases – but usually still delivers better than most other mathematical libraries. In previous versions of Intel MKL, setting the environment variables MKL_DEBUG_CPU_TYPE=5 and MKL_CBWR=AUTO improved performance on AMD processors. This no longer works with recent MKL versions, see also https://github.com/tensorflow/tensorflow/issues/49113 and https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html. NHR@FAU does not promote these workarounds; however, if you nevertheless follow them by setting LD_PRELOAD do not forget to still set MKL_CBWR=AUTO.

Further Intel tools may be added in the future.

The Intel modules on Fritz, Alex and the Slurm-TinyGPU/TinyFAT behave different than on the older RRZE systems: (1) The intel64 module has been renamed to intel and no longer automatically loads intel-mpi and mkl. (2) intel-mpi/VERSION-intel and intel-mpi/VERSION-gcc have been unified into intel-mpi/VERSION. The selection of the compiler occurs by the wrapper name, e.g. mpicc = GCC, mpiicc = Intel; mpif90 = GFortran; mpiifort = Intel.

Nvidia compilers (CUDA and formerly PGI)

The CUDA compilers are part of the cuda modules.

The Nvidia (formerly PGI) compilers are part of the nvhpc modules.

Multi-Process Service (MPS daemon)

The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs. This can benefit performance when the GPU compute capacity is underutilized by a single application process.

Using MPS with single-GPU jobs

# set necessary environment variables and start the MPS daemon
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps.$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log.$SLURM_JOB_ID
nvidia-cuda-mps-control -d
# do your work (a.out is just a placeholder)
./a.out -param 1 &
./a.out -param 2 & 
./a.out -param 3 & 
./a.out -param 4 & 
wait
# stop the MPS daemon
echo quit | nvidia-cuda-mps-control

Using MPS with multi-GPU jobs

# set necessary environment variables and start the MPS daemon
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
    echo "starting mps server for $GPU"
    export CUDA_VISIBLE_DEVICES=$GPU
    export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
    export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log-${GPU}.$SLURM_JOB_ID
    nvidia-cuda-mps-control -d
done
# do your work - you may need to set CUDA_MPS_PIPE_DIRECTORY correctly per process!!
...
# cleanup MPS
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
    echo "stopping mps server for $GPU"
    export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
    echo 'quit' | nvidia-cuda-mps-control
done

See also http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html and https://stackoverflow.com/questions/36015005/cuda-mps-servers-fail-to-start-on-workstation-with-multiple-gpus.

Open MPI

Open MPI is the default MPI for the Alex cluster. Usage of srun instead of mpirun is recommended.

Open MPI is built using Spack:

  • with the compiler mentioned in the module name; the corresponding compiler will be loaded as dependency when the Open MPI modules is loaded
  • with support for CUDA (cuda/11.5 as of 11/2021)
  • without support for thread-multiple
  • with fabrics=ucx
  • with support for Slurm as scheduler (and internal PMIx of Open MPI)

Python, conda environments, Tensorflow, and Pytorch

Do not rely on the Python installation from the operating system. Use our python modules instead. These installations will be updated in place from time to time. We can add further packages from the Miniconda distribution as needed.

You can modify the Python environment as follows:

Set the location where pip and conda install packages to $WORK, see Python and Jupyter for details. By default packages will be installed in $HOME, which has limited capacity.

Extend the base environment
$ pip install --user <packages>'

Create a new one of your own
$ conda create -n <environment_name> <packages>'

Clone and modify this environment

$ conda create --name myclone --clone base
$ conda install --name myclone new_package

See also https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html.

We also provide some specialized Python modules, e.g., python/pytorch-1.10py3.9 and python/tensorflow-2.7.0py3.9 (as of Jan 2022).

If you do not use these modules, note that cuda and cudnn are separate modules and that the login nodes do not have a GPU installed.

It is not recommended for security reasons to run TensorBoard on a multi-user system. ThensorBoard does not come with any means of access control and anyone with access to the multi-user system can attach to your TensorBoard port and act as you! (It might only need some effort to find the port if you do not use the default port.) There is nothing NHR@FAU can do to mitigate these security issues. Even the hint about --host localhost in https://github.com/tensorflow/tensorboard/issues/260#issuecomment-471737166 does not help on a multi-user system. The suggestion from https://github.com/tensorflow/tensorboard/issues/267#issuecomment-671820015 does not help either on a multi-user system. We patched the preinstalled TensorBoard version according to https://github.com/tensorflow/tensorboard/pull/5570 using a hash will be enforced.

However, we recommend using TensorBoard on your local machine with the HPC-filesystem mounted (e.g. sshfs).

Arm DDT

Arm DDT is a powerful parallel debugger. NHR@FAU holds a license for 32 processes and 4 GPUs.

Amber

NHR@FAU holds a “compute center license” of Amber 20 and 22, thus, Amber is generally available to everyone for non-profit use, i.e. for academic research.

Amber usually delivers the most economic performance if only one GPGPU is used. The correct PMEMD binary then is pmemd.cuda.

The amber/20p12-at21p11-ompi-gnu-cuda11.5 module from 11/2021 contains the additional bug fix discussed in http://archive.ambermd.org/202110/0210.html / http://archive.ambermd.org/202110/0218.html.

Gromacs

We provide Gromacs versions without and with PLUMED. Gromacs (and PLUMED) are built using Spack.

Gromacs usually delivers the most economic performance if only one GPGPU is used together with the Thread-MPI implementation of Gromacs (no mpirun needed; the number of processes is specified directly using the gmx mdrun command line argument -ntmpi)  are used. Therefore, a “real” MPI version of Gromacs is only provided together with PLUMED. In that case the binary name is gmx_mpi and must be started with srun or mpirun like any other MPI program.

Do not start gmx mdrun with the option -v. The verbose output will only create extra large Slurm stdout files and your jobs will suffer if the NFS servers have high load. There is also only very limited use to see in the stdout all the time when the job is expected to reach the specified number of steps.

LAMMPS

The modules lammps/20201029-gcc10.3.0-openmpi-mkl-cuda and lammps/20211027-gcc10.3.0-openmpi-mkl-cuda have been compiled using Gcc-10.3.0, Intel OneAPI MKL, Open MPI 4.1.1, and with

  • GPU package API: CUDA; GPU package precision: mixed; for sm_80
  • KOKKOS package API: CUDA OpenMP Serial; KOKKOS package precision: double; for sm_80
  • Installed packages
    • for 20201029: ASPHERE BODY CLASS2 COLLOID COMPRESS CORESHELL DIPOLE GPU GRANULAR KIM KOKKOS KSPACE LATTE MANYBODY MC MISC MOLECULE MPIIO PERI POEMS PYTHON QEQ REPLICA RIGID SHOCK SNAP SPIN SRD USER-ATC USER-H5MD USER-LB USER-MEAMC USER-MISC USER-NETCDF USER-OMP USER-REAXC VORONOI
    • for 20211027: ASPHERE BODY CLASS2 COLLOID COMPRESS CORESHELL DIPOLE GPU GRANULAR KIM KOKKOS KSPACE LATTE MANYBODY MC MISC MOLECULE MPIIO PERI POEMS PYTHON QEQ REPLICA RIGID SHOCK SPIN SRD VORONOI

The LAMMPS binary is called just lmp.

Run module avail lammps to see all currently installed Lammps modules. Allocate an interactive job and run mpirun -np 1 lmp -help to see which Lammps packages have been included in a specific build.

NAMD

NAMD comes with a license which prohibits us to “just install and everyone can use it”. We, therefore, need individual users to print and sign the NAMD license. Subsequently, we will set the permissions accordingly.

At the moment, we provide the official pre-built Linux-x86_64-multicore-CUDA (NVIDIA CUDA acceleration) binary.

VASP

VASP comes with a license which prohibits us to “just install and everyone can use it”. We have to individually check each VASP user.

At the moment we provide two different VASP 6.2.3 and 6.3.0 modules to eligible users:

  • vasp/6.x.y-nccl – NCCL stands for Nvidia Collective Communication Library and is basically a library for direct GPU to GPU communication. However, NCCL only allows one  one MPI rank per GPU. In 6.2.1 you can disable NCCL via the Input-file, but sadly the testsuite will still fail.
  • vasp/6.x.y-nonccl – in certain cases, one MPI rank per GPU is not enough to saturate a single A100. When you use multiple ranks per GPU, you should also use the so called MPS server. See “Multi-Process Service (MPS daemon)” above on how to start MPS even in case of multiple GPUs.

The VASP 6.3.0 has been compiled with the new support for HDF5.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization match instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity (nowadays called Apptainer). However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Nvidia drivers from the host will automatically be mounted into your container. All filesystems will also be available by default in the container. In certain use-cases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

File Systems

The following table summarizes the available file systems and their features. It is only an excerpt from the description of the HPC file system.

Further details will follow.

Quota in $HOME is very limited as snapshots are made every 30 minutes. Put simulation data to $WORK! Do not rely on the specific path of $WORK as this may change over time when your work directory is relocated to a different NFS server.

Batch processing

As with all production clusters at RRZE, resources are controlled through a batch system. The front ends can be used for compiling and very short serial test runs which do not require a GPU, but everything else has to go through the batch system to the cluster.

Alex uses SLURM as a batch system. Please see our general batch system description for further details.

For every batch job, you have to specify the number of GPUs that should be allocated to your job. For single-node jobs, the compute nodes are not allocated exclusively but are shared among several jobs – the GPUs themselves are always granted exclusively.  Resources are granted on a per-GPU basis. The corresponding share of the resources of the host system (CPU cores, RAM) is automatically allocated: for each A40 you get 16 CPU cores and 60 GB RAM on the host assigned; for each A100 you get 16 CPU cores and 120 GB RAM on the host assigned.

If your application is able to use more than one node and its corresponding GPUs efficiently, multi-node jobs are available on demand. In this case, the nodes will be allocated exclusively for your job, i.e. you get access to all GPUs, CPUs and RAM of the node automatically.

Partitions on the Alex GPGPU cluster (preliminary definition)
Partition min – max walltime min – max GPUs –gres (with # being the number requested of GPUs) automatically assigned cores automatically assigned host memory Comments
a40 0 – 24:00:00

(max. 6h for interactive jobs)

1-8 --gres=gpu:a40:# 16 per GPU 60 GB per GPU Jobs run on a node with Nvidia A40 GPGPUs; the GPGPUs are exclusive but the node may be shared and jobs are confined to their cgroup.
a100 0 – 24:00:00

(max. 6h for interactive jobs)

1-8 --gres=gpu:a100:# 16 per GPU 120 GB per GPU Jobs run on a node with Nvidia A100 GPGPUs; the GPGPUs are exclusive but the node may be shared and jobs are confined to their cgroup.

Multi-node jobs are only available on demand for NHR projects but not within the free Tier3/FAU access. As NHR project, contact us if you are interested.

Example Slurm Batch Scripts

For the most common use cases, examples are provided below. Please see our general batch system description for further details. Note that these scripts possibly have to be adapted to your specific application and use case!

 

Interactive job (single-node)

Interactive jobs can be requested by using salloc instead of sbatch and specifying the respective options on the command line.

The following will give you an interactive shell on one of the A40 nodes for one hour:

salloc --gres=gpu:a40:1 --partition=a40 --time=01:00:00

Note that settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

MPI parallel job (single-node)

In this example, the executable will be run using 16 MPI processes (i.e. one per physical core) for a total job walltime of 6 hours. The job allocates one A40 GPU and the corresponding share of CPUs and main memory ( 16 cores and 60GB RAM) automatically.

#!/bin/bash -l
#
#SBATCH --ntasks=16
#SBATCH --time=06:00:00
#SBATCH --gres=gpu:a40:1
#SBATCH --partition=a40
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

module load XXX
srun ./cuda_application

Hybrid MPI/OpenMP job (single-node)

In this example, one A100 GPU is allocated. The executable will be run using 2 MPI processes with 8 OpenMP threads each for a total job walltime of 6 hours. 16 cores are allocated in total and each OpenMP thread is running on a physical core.

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

#!/bin/bash -l
#
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
#SBATCH --time=06:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

module load XXX
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./cuda_application

OpenMP job (single-node)

In this example, the executable will be run using 16 OpenMP threads for a total job walltime of 6 hours. One A100 GPU and the corresponding 16 cores are allocated in total.

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

#!/bin/bash -l 
#
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:a100:1 
#SBATCH --partition=a100
#SBATCH --time=06:00:00 
#SBATCH --export=NONE 

unset SLURM_EXPORT_ENV 

module load XXX

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./cuda_application

Multi-node Job (available on demand for NHR projects)

In this case, your application has to be able to use more than one node and its corresponding GPUs at the same time. The nodes will be allocated exclusively for your job, i.e. you get access to all GPUs, CPUs and RAM of the node automatically.

Adjust the options --nodes and--ntasks-per-node to your application. Since the nodes are allocated exclusively, there is usually no need to specify --cpus-per-task for hybrid OpenMP/MPI applications. However, the correct value of $OMP_NUM_THREADShas to be set explicitly since OpenMP is not Slurm-aware!

 

#!/bin/bash -l
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --gres=gpu:a100:8
#SBATCH --partition=a100
#SBATCH --qos=a100multi
#SBATCH --time=01:00:00
#SBATCH --export=NONE 

unset SLURM_EXPORT_ENV

module load XXX
srun ./cuda_application
Attach to a running job

On the frontend node, the following steps are necessary:

  1. Check on which node the job is running with squeue.
  2. If you have only one job running on a node, you can use ssh <nodename> to connect to it. You will be placed in the allocation of the job.
  3. If you have multiple jobs running on a node, use srun --jobid=<jobid> --overlap --pty /bin/bash to attach to a specific job.

Attaching to a running job can be used e.g. to check GPU utilization via nvidia-smi. For more information on nvidia-smi and GPU profiling, see Working with NVIDIA GPUs.

Further Information

AMD EPYC 7713 “Milan” Processor

Each node has two processor chips. The specs per processor chip are as follows:

  • # of CPU Cores: 64
  • # of Threads: 128 – hyperthreading (SMT) is disabled on Alex for security reasons; thus, threads and physical cores are identical
  • Max. Boost Clock: Up to 3.675 GHz
  • Base Clock: 2.0 GHz
  • Default TDP: 225W; AMD Configurable TDP (cTDP): 225-240W
  • Total L3 Cache: 256MB
  • System Memory Type: DDR4 @ 3,200 MHz
  • Memory Channels: 8 – these can be arranged in 1-4 ccNUMA domains (“NPS” setting); Alex is running with NPS=4
  • Theoretical per Socket Mem BW: 204.8 GB/s

Specs of an Nvidia A40 vs. A100 GPGPU

A40 A100 (SMX)
GPU architecture Ampere; SM_86, compute_86 Ampere; SM_80, compute_80
GPU memory 48 GB GDDR6 with ECC (ECC disabled on Alex) 40GB HBM2 / 80 GB HBM2
Memory bandwidth 696 GB/s 1,555 GB/s / 2,039 GB/s
Interconnect interface PCIe Gen4 31.5 GB/s (bidirectional) NVLink: 600GB/s
CUDA Cores (Ampere generation) 10,752 (84 SMs) 6,912 (108 SMs)
RT Cores (2nd generation) 84
Tensor Cores (3rd generation) 336 432
FP64 TFLOPS (non-Tensor) 0.5 9.7
FP64 Tensor TFLOPS 19.5
Peak FP32 TFLOPS (non-Tensor) 37.4 19.5
Peak TF32 Tensor TFLOPS 74.8 156
Peak FP16 Tensor TFLOPS with FP16 Accumulate 149.7 312
Peak BF16 Tensor TFLOPS with FP32 Accumulate 149.7 312
RT Core performance TFLOPS 73.1 ?
Peak INT8 Tensor TOPS
Peak INT 4 Tensor TOPS
299.3
598.7
624
1,248
Max power consumption 300 W 400 W
Price $$ $$$$

A40 data taken from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a40/proviz-print-nvidia-a40-datasheet-us-nvidia-1469711-r8-web.pdf (11/2021).

A100 data taken from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf (11/2021).

Nvidia A40 GPGPU nodes

Photo of an A40 node
Photo of an open A40 node.

The Nvidia A40 GPGPUs (like the Geforce RTX 3080 consumer cards) belong to the Ampere generation. The native architecture is SM86 or SM_86, compute_86.

All eight A40 GPGPUs of a node are connected to two PCIe switches. Thus, there is only limited bandwidth to the host system and also between the GPGPUs.

“Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.” (according to https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32)

Topology of the octo A40 nodes according to nvidia-smi topo -m; AMD Milan processor in NPS=1 mode:

       GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7   mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0   X    NODE NODE NODE SYS  SYS  SYS  SYS    SYS     SYS     0-63        0
GPU1   NODE X    NODE NODE SYS  SYS  SYS  SYS    SYS     SYS     0-63        0
GPU2   NODE NODE X    NODE SYS  SYS  SYS  SYS    SYS     SYS     0-63        0
GPU3   NODE NODE NODE X    SYS  SYS  SYS  SYS    SYS     SYS     0-63        0
GPU4   SYS  SYS  SYS  SYS  X    NODE NODE NODE   NODE    NODE   64-127       1
GPU5   SYS  SYS  SYS  SYS  NODE X    NODE NODE   NODE    NODE   64-127       1
GPU6   SYS  SYS  SYS  SYS  NODE NODE X    NODE   PHB     PHB    64-127       1
GPU7   SYS  SYS  SYS  SYS  NODE NODE NODE X      NODE    NODE   64-127       1
mlx5_0 SYS  SYS  SYS  SYS  NODE NODE PHB  NODE   X       PIX    25 GBE
mlx5_1 SYS  SYS  SYS  SYS  NODE NODE PHB  NODE   PIX     X      (25 GbE not connected)

Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge

Topology of the octo A40 nodes according to nvidia-smi topo -m; AMD Milan processor in NPS=4 mode (current setting of Alex):

       GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7   mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0   X    SYS  SYS  SYS  SYS  SYS  SYS  SYS    SYS     SYS    48-63        3
GPU1   SYS  X    SYS  SYS  SYS  SYS  SYS  SYS    SYS     SYS    32-47        2
GPU2   SYS  SYS  X    SYS  SYS  SYS  SYS  SYS    SYS     SYS    16-31        1
GPU3   SYS  SYS  SYS  X    SYS  SYS  SYS  SYS    SYS     SYS     0-15        0
GPU4   SYS  SYS  SYS  SYS  X    SYS  SYS  SYS    SYS     SYS   112-127       7
GPU5   SYS  SYS  SYS  SYS  SYS  X    SYS  SYS    SYS     SYS    96-111       6
GPU6   SYS  SYS  SYS  SYS  SYS  SYS  X    SYS    PHB     PHB    80-95        5
GPU7   SYS  SYS  SYS  SYS  SYS  SYS  SYS  X      SYS     SYS    64-79        4
mlx5_0 SYS  SYS  SYS  SYS  SYS  SYS  PHB  SYS    X       PIX    25 GBE
mlx5_1 SYS  SYS  SYS  SYS  SYS  SYS  PHB  SYS    PIX     X      (25 GbE not connected)

Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge

Nvidia A100 GPGPU nodes

Photo of an open A100 node.
Photo of the GPU board of an open A100 node.

The Nvidia A100 GPGPUs belong to the Ampere generation. The native architecture is SM80 or SM_80, compute_80.

All four or eight A100 GPGPUs of a node are directly connected with each other through an NVSwitch providing 600 GB/s GPU-to-GPU bandwidth for each GPGPU.

Topology of the quad A100 nodes according to nvidia-smi topo -m; no 25 GbE / HDR200 cards yet; AMD Rome processor in NPS=2 mode

     GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X    NV4  NV4  NV4  32-63        1
GPU1 NV4  X    NV4  NV4   0-31        0
GPU2 NV4  NV4  X    NV4  96-127       3
GPU3 NV4  NV4  NV4  X    64-95        2

Legend:
X    = Self
SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
NV#  = Connection traversing a bonded set of # NVLinks

Topology of the octo A100 nodes according to nvidia-smi topo -m; AMD Milan processor in NPS=1 mode

       GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7  mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0   X    NV12 NV12 NV12 NV12 NV12 NV12 NV12  PXB     NODE   NODE   SYS      0-63        0
GPU1   NV12 X    NV12 NV12 NV12 NV12 NV12 NV12  PXB     NODE   NODE   SYS      0-63        0
GPU2   NV12 NV12 X    NV12 NV12 NV12 NV12 NV12  NODE    PXB    PXB    SYS      0-63        0
GPU3   NV12 NV12 NV12 X    NV12 NV12 NV12 NV12  NODE    PXB    PXB    SYS      0-63        0
GPU4   NV12 NV12 NV12 NV12 X    NV12 NV12 NV12  SYS     SYS    SYS    NODE    64-127       1
GPU5   NV12 NV12 NV12 NV12 NV12 X    NV12 NV12  SYS     SYS    SYS    NODE    64-127       1
GPU6   NV12 NV12 NV12 NV12 NV12 NV12 X    NV12  SYS     SYS    SYS    PXB     64-127       1
GPU7   NV12 NV12 NV12 NV12 NV12 NV12 NV12 X     SYS     SYS    SYS    PXB     64-127       1
mlx5_0 PXB  PXB  NODE NODE SYS  SYS  SYS  SYS   X       NODE   NODE   SYS    HDR200
mlx5_1 NODE NODE PXB  PXB  SYS  SYS  SYS  SYS   NODE    X      PIX   SYS     25 GbE
mlx5_2 NODE NODE PXB  PXB  SYS  SYS  SYS  SYS   NODE    PIX    X     SYS     (25 GbE not connected)
mlx5_3 SYS  SYS  SYS  SYS  NODE NODE PXB  PXB   SYS     SYS    SYS   X       HDR200

Legend:
X    = Self
SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
NV#  = Connection traversing a bonded set of # NVLinks

Topology of the octo A100 nodes according to nvidia-smi topo -m; AMD Milan processor in NPS=4 mode (current setting of Alex):

       GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7  mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0   X    NV12 NV12 NV12 NV12 NV12 NV12 NV12  PXB     SYS    SYS    SYS    48-63       3
GPU1   NV12 X    NV12 NV12 NV12 NV12 NV12 NV12  PXB     SYS    SYS    SYS    48-63       3
GPU2   NV12 NV12 X    NV12 NV12 NV12 NV12 NV12  SYS     PXB    PXB    SYS    16-31       1
GPU3   NV12 NV12 NV12 X    NV12 NV12 NV12 NV12  SYS     PXB    PXB    SYS    16-31       1
GPU4   NV12 NV12 NV12 NV12 X    NV12 NV12 NV12  SYS     SYS    SYS    SYS    112-127     7
GPU5   NV12 NV12 NV12 NV12 NV12 X    NV12 NV12  SYS     SYS    SYS    SYS    112-127     7
GPU6   NV12 NV12 NV12 NV12 NV12 NV12 X    NV12  SYS     SYS    SYS    PXB    80-95       5
GPU7   NV12 NV12 NV12 NV12 NV12 NV12 NV12 X     SYS     SYS    SYS    PXB    80-95       5
mlx5_0 PXB  PXB  SYS  SYS  SYS  SYS  SYS  SYS   X       SYS    SYS    SYS    HDR200
mlx5_1 SYS  SYS  PXB  PXB  SYS  SYS  SYS  SYS   SYS     X      PIX    SYS    25 GbE
mlx5_2 SYS  SYS  PXB  PXB  SYS  SYS  SYS  SYS   SYS     PIX    X      SYS    (25 GbE not connected)
mlx5_3 SYS  SYS  SYS  SYS  SYS  SYS  PXB  PXB   SYS     SYS    SYS    X      HDR200

Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

 

Erlangen National High Performance Computing Center (NHR@FAU)
Martensstraße 1
91058 Erlangen
Germany
  • Imprint
  • Privacy
  • Accessibility
  • How to find us
Up