Alex GPGPU cluster (NHR+Tier3)

FAU’s Alex cluster (system integrator: Megware) is a high-performance compute resource with Nvidia GPGPU accelerators and partially high speed interconnect. It is intended for single and multi GPGPU workloads, e.g. from molecular dynamics, or machine learning. Alex serves for both, FAU’s basic Tier3 resources as well as NHR’s project resources.

  • 2 front end nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 512 GB of RAM, and 100 GbE connection to RRZE’s network backbone but no GPGPUs.
  • NOT YET PART OF ALEX BUT STILL IN TINYGPU – 8 GPGPU nodes, each with two AMD EPYC 7662 “Rome” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 512 GB of DDR4-RAM, four Nvidia A100 (each 40 GB HBM2 @ 1,555 GB/s; DGX board with NVLink; 9.7 TFlop/s in FP64 or 19.5 TFlop/s in FP32), one HDR200 Infiniband HCAs, 25 GbE, and 6 TB on local NVMe SSDs. (During the year 2021, these nodes have previously been part of TinyGPU.)
  • 20 GPGPU nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3 cache per chip, 1,024 GB of DDR4-RAM, eight Nvidia A100 (each 40 GB HBM2 @ 1,555 GB/s; HGX board with NVLink; 9.7 TFlop/s in FP64 or 19.5 TFlop/s in FP32), two HDR200 Infiniband HCAs, 25 GbE, and 14 TB on local NVMe SSDs.
  • 38 GPGPU nodes, each with two AMD EPYC 7713 “Milan” processors (64 cores per chip) running at 2.0 GHz with 256 MB Shared L3Cache per chip, 512 GB of DDR4-RAM, eight Nvidia A40 (each with 48 GB DDR6 @ 696 GB/s; 37.42 TFlop/s in FP32), 25 GbE, and 7 TB on local NVMe SSDs.

So there is a total of 192 Nvidia A100 and 304 Nvidia A40 GPGPUs. The Nvidia A40 GPGPUs have a very high single precision floating point performance and are much less expensive than Nvidia A100 GPGPUs. All workloads which only require single precision floating point operations, like many molecular dynamics applications, thus, should target the Nvidia A40 GPGPUs. Alex complements RRZE’s TinyGPU cluster. Alex addresses high-end GPGPU workloads while TinyGPU mainly comes with consumer GPUs of different generations which, nevertheless, provide an invincible price-performance ratio for single precision floating point applications which require only little GPU memory.

The name “Alex” is a play with the name of FAU’s early benefactor Alexander, Margrave of Brandenburg-Ansbach (1736-1806).

Alex is not yet ready for use!

All documentation is preliminary and subject to change.

This website shows information regarding the following topics:

Access, User Environment, and File Systems

Access to the machine

Note that access to Alex is not yet open. If you want to be among the firsts to get access to Alex once early operation starts, you will need to provide a short description of what you want to do there: https://hpc.fau.de/early-adopter-alex/

Users can connect to alex.nhr.fau.de (keep the “nhr” instead of “rrze” in mind!) and will be randomly routed to one of the two front ends. All systems in the cluster, including the front ends, have private IPv4 addresses in the 10.28.52.0/23 and IPv6 addresses in the 2001:638:a000:3952::/64 range. They can normally only be accessed directly from within the FAU networks. There is one exception: If your internet connection supports IPv6, you can directly ssh to the front  ends (but not to the compute nodes). Otherwise, if you need access from outside of FAU, you usually have to connect for example to the dialog server cshpc.rrze.fau.de first and then ssh to alex.nhr.fau.de from there.

SSH public host keys of alex.nhr.fau.de (as of 11/2021)

ssh-dss AAAAB3NzaC1kc3MAAACBAO/DMbHuyYO6vWXgoeFgaVXFIbg6vldW3ViGOJSd/yVopqhB/fxdp4z1SioML9YOSNepr58xpgoFXFpM+DgRgwcIMBYbV3CeyPYoF4ZAvVwkQLGZh5zmn1Zxd6U3B49aZaEYnItRO6VKGW/Bm6cKY3H+FW5NUa8u+CQOjbjCmixBAAAAFQDpdsCURZAgCd8durljTJHF2AMR+wAAAIAWxlbOXYcMdgmYWE7Af3CyKysbaC1whHNiWOK3v4b0HEZ3CWQe50rrZWDzTKyal0AkncghPMusz5hqZCbZC3DrAParSTwk8RGXsbRm6O/cF3JBKP6IhIBvc8kEVaqFeyDuFwMXwzwQU8x4esAkIu+GDCiCADlhiGSf2Uw6pEds+gAAAIEAgVxOFD9eFM+pDMSw/NlyVXSVA512uC4/JnrHDfY+6SjhuOcfd5JOWjDNYxKO0xPruj0H/TpAI+h90/yUHUff9F/g8rPg9S55DtsUyJHY8B9mm7/mKnJfcT68EBheH00Vl4yGLFu6q4mmfwoHVSoV7QikVj5vOlmWOVMjJen8NKc= alex.nhr.fau.de
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCIUVUt/y9dOrXP3aZ3meUF8s77/d+sk/F31tMnw2TNL4mk6J5Ylk2SOtDL7GTCrxmj3/RXMrrPCKO8FDJR2SzE= alex.nhr.fau.de
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINPXIVMupI341xGq6Gb5agSqurqTSssyBORWKx3wAU0p alex.nhr.fau.de
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDMOn7p0mPhTeZndNjnLIF7RKeA+WaXz4vJ0lFEo7cpXV9I5AmbduM/GkzEdGNAvVgWmcYtW3R53R23c1eikSFFx6aUaK1rb0kp2SYlh+JUvXRLIg+oIK47Do3lQ7qDas1Q7U9wssHr1wrs5g6dsQj+v7UFJcCAqcAz4KfxaJrG8MkYpI0P38TSe3p39+ObDv+NoBKobHgR9kyYGx5tgLC8YFakBoBkgUJvgIEVBsSz4InPQfZjFchw31+wYgeuQykLA7OpE3kHbPv8WlXf+n9Rt0fguGJnLJcGT1WzdeG1Y7njDC6mj92pNUoLr8KvoE7Qq/i5Wt3PAOWP4/lUywpbPVPso5z8h6vo99mhdg3N/zs8sL5jEfCWGyGAvoxGI91JxDBFE9GJTNwI6nrFx9Qb2lw8JUnO6L/yPj2dBKd3zdAgikg6Wh8NqA0Gb9RbWGk6zidsO1y8mvrg9y1r20MXkFYsHMMrcslym2yvRVj2zJeLOPDA8S4knsY9UGudE8E= alex.nhr.fau.de

fingerprints of the SSH public host keys of alex.nhr.fau.de (as of 11/2021)

1024 SHA256:4f7EsRXG9U6L7ZnGfQYV+IFLsF1l7wfGrf9zGZMgl7A alex.nhr.fau.de (DSA)
256 SHA256:0lZv5WzJGvZkdP+zGZY9bKhPucKyhLQIkJzsC9y0T00 alex.nhr.fau.de (ECDSA)
256 SHA256:53K9MoZ920hbooWthNUv84ubES6kpxjkVSiy0kcoYc8 alex.nhr.fau.de (ED25519)
3072 SHA256:kA0Or+7QAuRikKp6MzDQNeAxg1j4NV/1hbkp9IlKJXw alex.nhr.fau.de (RSA)

SSH public host keys of cshpc.rrze.fau.de (as of 11/2021)

ssh-dss AAAAB3NzaC1kc3MAAACBAO2L8+7bhJm7OvvJMcdGSJ5/EaxvX5RRzE9RrB8fx5H69ObkqC6Baope4rOS9/+2gtnm8Q3gZ5QkostCiKT/Wex0kQQUmKn3fx6bmtExLq8YwqoRXRmNTjBIuyZuZH9w/XFK36MP63p/8h7KZXvkAzSRmNVKWzlsAg5AcTpLSs3ZAAAAFQCD0574+lRlF0WONMSuWeQDRFM4vwAAAIEAz1nRhBHZY+bFMZKMjuRnVzEddOWB/3iWEpJyOuyQWDEWYhAOEjB2hAId5Qsf+bNhscAyeKgJRNwn2KQMA2kX3O2zcfSdpSAGEgtTONX93XKkfh6JseTiFWos9Glyd04jlWzMbwjdpWvwlZjmvPI3ATsv7bcwHji3uA75PznVUikAAACBANjcvCxlW1Rjo92s7KwpismWfcpVqY7n5LxHfKRVqhr7vg/TIhs+rAK1XF/AWxyn8MHt0qlWxnEkbBoKIO5EFTvxCpHUR4TcHCx/Xkmtgeq5jWZ3Ja2bGBC3b47bHHNdDJLU2ttXysWorTXCoSYH82jr7kgP5EV+nPgwDhIMscpk cshpc.rrze.fau.de
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBNVzp97t3CxlHtUiJ5ULqc/KLLH+Zw85RhmyZqCGXwxBroT+iK1Quo1jmG6kCgjeIMit9xQAHWjS/rxrlI10GIw= cshpc.rrze.fau.de
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPSIFF3lv2wTa2IQqmLZs+5Onz1DEug8krSrWM3aCDRU cshpc.rrze.fau.de
1024 35 135989634870042614980757742097308821255254102542653975453162649702179684202242220882431712465065778248253859082063925854525619976733650686605102826383502107993967196649405937335020370409719760342694143074628619457902426899384188195801203193251135968431827547590638365453993548743041030790174687920459410070371 cshpc.rrze.fau.de
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAs0wFVn1PN3DGcUtd/JHsa6s1DFOAu+Djc1ARQklFSYmxdx5GNQMvS2+SZFFa5Rcw+foAP9Ks46hWLo9mOjTV9AwJdOcSu/YWAhh+TUOLMNowpAEKj1i7L1Iz9M1yrUQsXcqDscwepB9TSSO0pSJAyrbuGMY7cK8m6//2mf7WSxc= cshpc.rrze.fau.de

fingerprints of the SSH public host keys of cshpc.rrze.fau.de (as of 11/2021)

1024 SHA256:A82eA7py46zE/TrSTCRYnJSW7LZXY16oOBxstJF3jxU root@wtest05 (DSA)
256 SHA256:wFaDywle3yJvygQ4ZAPDsi/iSBTaF6Uoo0i0z727aJU root@cshpc (ECDSA)
256 SHA256:is52MRsxMgxHFn58o0ZUh8vCzIuE2gYanmhrxdy0rC4 root@cshpc (ED25519)
1024 SHA256:Za1mKhTRFDXUwn7nhPsWc7py9a6OHqS2jin01LJC3ro root@wtest05 (RSA)

While it is possible to ssh directly to a compute node, users are only allowed to do this while they have a batch job running there. When all batch jobs of a user on a node have ended, all of their processes, including any open shells, will be killed automatically. If you have multiple batch jobs running on the same node, the ssh process will be added to the cgroup of one automatically selected job and only the GPUs of that job will be visible.

Software environment

The login and compute nodes run AlmaLinux8 (which is basically Redhat Enterprise Linux 8 without the support).

The login shell for all users on Alex is always bash and cannot be changed.

As on many other HPC systems,  environment modules are used to facilitate access to software packages. Type “module avail” to get a list of available packages. Even more packages will become visible once one of the 000-all-spack-pkgs modules has been loaded. Most of the software is installed using “Spack as enhanced HPC package manager; this includes the CUDA toolkit and the cuDNN library. Only the Nvidia device driver is installed as part of the operating system.

General notes on how to use certain software on our systems (including in some cases sample job scripts) can be found on the Special applications, and tips & tricks pages. Specific notes on how some software provided via modules on the Alex cluster has been compiled, can be found in the following accordion:

Intel One API is installed in the “Free User” edition via Spack.

The modules intel and intel-oneapi-compilers provides the legacy Intel compilers icc, icpc, and ifort as well as the new LLVM-based ones (icx, icpx, dpcpp, ifx).

The modules intelmpi and intel-oneapi-mpi provides Intel MPI. To use the legacy Intel compilers with Intel MPI you just have to use the appropriate wrappers with the Intel compiler names, i.e. mpiicc, mpiicpc, mpiifort. To use the new LLVM-based Intel compilers with Intel MPI you have to specify them explicitly, i.e use mpiicc -cc=icx, mpiicpc -cxx=icpx, or mpiifort -fc=ifx. The execution of mpicc, mpicxx, and mpif90 results in using the GNU compilers.

The modules mkl, tbb, intel-oneapi-mkl, and intel-oneapi-tbb provide Intel MKL and TBB. Use Intel’s MKL link line advisor to figure out the appropriate command line for linking with MKL. The Intel MKL also includes drop-in wrappers for FFTW3.

Further Intel tools may be added in the future.

The CUDA compilers are part of the cuda modules.

The Nvidia (formerly PGI) compilers are part of the nvhpc modules.

The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs. This can benefit performance when the GPU compute capacity is underutilized by a single application process.

TODO – more details

MPS_DIR=$(mktemp -d -u mps-XXXXXXXX)
# startup MPS
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
    echo starting mps server for $GPU
    mkdir ${MPS_DIR}-GPU$GPU
    mkdir ${MPS_DIR}-log-GPU$GPU
    export CUDA_VISIBLE_DEVICES=$GPU
    export CUDA_MPS_PIPE_DIRECTORY=${MPS_DIR}-$GPU
    export CUDA_MPS_LOG_DIRECTORY=${MPS_DIR}-log-$GPU
    nvidia-cuda-mps-control -d
done
# do your work
...
# cleanup MPS
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
    echo stopping mps server for $GPU
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mymps_$GPU
    echo 'quit' | nvidia-cuda-mps-control
    rm -rf ${MPS_DIR}-$GPU
    rm -rf ${MPS_DIR}-log-$GPU
done

Open MPI is the default MPI for the Alex cluster. Usage of srun instead of mpirun is recommended. (TO BE CONFIRMED)

Open MPI is built using Spack:

  • with the compiler mentioned in the module name; the corresponding compiler will be loaded as dependency when the Open MPI modules is loaded
  • with support for CUDA (cuda/11.5 as of 11/2021)
  • without support for thread-multiple
  • with fabrics=ofi
  • with support for Slurm as scheduler (and internal PMIx of Open MPI)

TBD

Amber is currently only available to eligible groups. We’ll upgrade to a compute center license in 2022 to make Amber generally available.

Amber usually delivers the most economic performance if only one GPGPU is used. The correct PMEMD binary then is pmemd.cuda.

The amber/20p12-at21p11-ompi-gnu-cuda11.5 module from 11/2021 contains the additional bug fix discussed in http://archive.ambermd.org/202110/0210.html / http://archive.ambermd.org/202110/0218.html.

We provide Gromacs versions without and with PLUMED. Gromacs (and PLUMED) are built using Spack.

Gromacs usually delivers the most economic performance if only one GPGPU is used together with the Thread-MPI implementation of Gromacs (no mpirun needed; the number of processes is specified directly using the gmx mdrun command line argument -ntmpi)  are used. Therefore, a “real” MPI version of Gromacs is only provided together with PLUMED. In that case the binary name is gmx_mpi and must be started with srun or mpirun like any other MPI program.

At the moment, two different Gromacs modules are provided for 2021.3: one using FFTW3 and one using Intel MKL as clearly shown in the module names. We are not yet sure if there is a performance difference for most real word inputs. One module may disappear in the future without further notice.

The modules lammps/20211027* have been compiled using Gcc-10.3.0, Intel OpeAPI MKL, Open MPI 4.1.1, and with

  • GPU package API: CUDA; GPU package precision: mixed; for sm_80 (module suffix -a100) or sm_86 (module suffix -a40)
  • KOKKOS package API: CUDA OpenMP Serial; KOKKOS package precision: double; for sm_80 or sm:86
  • Installed packages: ASPHERE BODY CLASS2 COLLOID COMPRESS CORESHELL DIPOLE GPU GRANULAR KIM KOKKOS KSPACE LATTE MANYBODY MC MISC MOLECULE MPIIO PERI POEMS PYTHON QEQ REPLICA RIGID SHOCK SPIN SRD VORONOI

NAMD comes with a license which prohibits us to “just install and everyone can use it”. We, therefore, need individual users to print and sign the NAMD license. Subsequently, we will set the permissions accordingly.

At the moment, we provide the official pre-built Linux-x86_64-multicore-CUDA (NVIDIA CUDA acceleration) binary.

VASP comes with a license which prohibits us to “just install and everyone can use it”. We have to individually check each VASP user.

At the moment we provide two different VASP 6.2.3 modules to eligible users:

  • vasp/6.2.3-nccl – NCCL stands for Nvidia Collective Communication Library and is basically a library for direct GPU to GPU communication. However, NCCL only allows one  one MPI rank per GPU. In 6.2.1 you can disable NCCL via the Input-file, but sadly the testsuite will still fail.
  • vasp/6.2.3-nonccl – in certain cases, one MPI rank per GPU is not enough to saturate a single A100. When you use multiple ranks per GPU, you should also use the so called MPS server. See “Multi-Process Service (MPS daemon)” above on how to start MPS even in case of multiple GPUs.

Feel free to compile software in the versions and with the options you need yourself. This is perfectly fine, yet support for self-installed software cannot be granted. We only can provide software centrally which is of importance for multiple groups. If you want to use Spack for compiling additional software, you can load our user-spack module to make use of the packages we already build with Spack if the concretization match instead of starting from scratch. Once user-spack is loaded, the command spack will be available (as alias), you will inherit the pre-sets we defined for certain packages (e.g. Open MPI to work with Slurm), but you’ll install everything into your own directories ($WORK/USER-SPACK).

You can also bring your own environment in a container using Singularity. However, building Singularity containers on the HPC systems themselves is not supported (as that would require root access). The Nvidia drivers from the host will automatically be mounted into your container. All filesystems will also be available by default in the container. In certain usecases it might be a good idea to avoid bind-mounting your normal $HOME directory with all its “dot directories” into the container by explicitly specifying a different directory, e.g. -H $HOME/my-container-home.

File Systems

The following table summarizes the available file systems and their features. It is only an excerpt from the description of the HPC file system.

Further details will follow once Alex is open for users.

Batch processing

As with all production clusters at RRZE, resources are controlled through a batch system. The front ends can be used for compiling and very short serial test runs which do not require a GPU, but everything else has to go through the batch system to the cluster.

Alex uses SLURM as a batch system. Please see our general batch system description for further details.

The granularity of batch allocations are individual GPGPUs, i.e. GPGPUs are never shared.

The following queues are available on this cluster:

Details on the queue configuration will follow once Alex is open for users.

Partitions on the Alex GPGPU cluster (preliminiary definition)
Partition min – max walltime min – max GPUs –gres (with # being the number requested of GPUs) assigned host memory availability Comments
any 0 – 24:00:00 1-8 --gres=gpu:# 60 GB per GPU always Jobs run either on a node with Nvidia A40 or A100 GPGPUs.
a40 0 – 24:00:00 1-8 --gres=gpu:a40:# 60 GB per GPU always Jobs run on a node with Nvidia A40 GPGPUs; the GPGPUs are exclusive but the node may be shared and jobs are confined to their Cgroup.
a100 0 – 24:00:00 1-8 --gres=gpu:a100:# 120 GB per GPU always Jobs run on a node with Nvidia A100 GPGPUs; the GPGPUs are exclusive but the node may be shared and jobs are confined to their Cgroup.
a100multi 0 – 24:00:00 16 – 32 --gres=gpu:a100:# 1000 GB per node on demand only Multi-node jobs on Nvidia A100; the requested number of GPUs must be a multiple of 8. Nodes and GPGPUs are exclusive.

TO BE DECIDED – the “any” partition might not be necessary as multiple partitions can be requested in Slurm out of the boy.

NOT IMPLEMENTED YET – There is routing based on the --gres=gpu:* specification. Thus, it should not be necessary to explicitly specify a partition via --partition=....

NOT IMPLEMENTED YET – Single GPU jobs of up to 2 hours will automatically take advantage of resources reserved for short running jobs.

Multi-node jobs will not be supported initially. Contact us if you would be interested.

Further Information

AMD EPYC 7713 “Milan” Processor

Each node has two processor chips. The specs per processor chip are as follows:

  • # of CPU Cores: 64
  • # of Threads: 128 – hyperthreading is disabled on Alex for security reasons; thus, threads and physical cores are identical
  • Max. Boost Clock: Up to 3.675 GHz
  • Base Clock: 2.0 GHz
  • Default TDP: 225W; AMD Configurable TDP (cTDP): 225-240W
  • Total L3 Cache: 256MB
  • System Memory Type: DDR4 @ 3,200 MHz
  • Memory Channels: 8 – these can be arranged in 1-4 ccNUMA domains (“NPS” setting); Alex is running with NPS=4 (TBD)
  • Theoretical per Socket Mem BW: 204.8 GB/s

Specs of an Nvidia A40 vs. A100 GPGPU

A40 A100 (SMX)
GPU architecture Ampere; SM_86, compute_86 Ampere; SM_80, compute_80
GPU memory 48 GB GDDR6 with ECC (ECC disabled on Alex) 40GB HBM2
Memory bandwidth 696 GB/s 1,555 GB/s
Interconnect interface PCIe Gen4 31.5 GB/s (bidirectional) NVLink: 600GB/s
CUDA Cores (Ampere generation)
10,752 6,912
RT Cores (2nd generation) 84
Tensor Cores (3rd generation) 336 432
FP64 TFLOPS (non-Tensor)
  9.7
FP64 Tensor TFLOPS   19.5
Peak FP32 TFLOPS (non-Tensor) 37.4 19.5
Peak TF32 Tensor TFLOPS 74.8 156
Peak FP16 Tensor TFLOPS with FP16 Accumulate 149.7 312
Peak BF16 Tensor TFLOPS with FP32 Accumulate 149.7 312
RT Core performance TFLOPS 73.1 ?
Peak INT8 Tensor TOPS
Peak INT 4 Tensor TOPS
299.3
598.7
624
1,248
Max power consumption 300 W 400 W
 Price  $$  $$$$

A40 data taken from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a40/proviz-print-nvidia-a40-datasheet-us-nvidia-1469711-r8-web.pdf (11/2021).

A100 data taken from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf (11/2021).

Nvidia A40 GPGPU nodes

Photo of an A40 node

Photo of an open A40 node.

The Nvidia A40 GPGPUs (like the Geforce RTX 3080 consumer cards) belong to the Ampere generation. The native architecture is SM86 or SM_86, compute_86.

All eight A40 GPGPUs of a node are connected to two PCIe switches. Thus, there is only limited bandwidth to the host system and also between the GPGPUs.

“Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.” (according to https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32)

Topology of the octo A40 nodes according to nvidia-smi topo -m; AMD Milan processor in NPS=1 mode:

       GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7   mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0   X    NODE NODE NODE SYS  SYS  SYS  SYS    SYS     SYS     0-63        0
GPU1   NODE X    NODE NODE SYS  SYS  SYS  SYS    SYS     SYS     0-63        0
GPU2   NODE NODE X    NODE SYS  SYS  SYS  SYS    SYS     SYS     0-63        0
GPU3   NODE NODE NODE X    SYS  SYS  SYS  SYS    SYS     SYS     0-63        0
GPU4   SYS  SYS  SYS  SYS  X    NODE NODE NODE   NODE    NODE   64-127       1
GPU5   SYS  SYS  SYS  SYS  NODE X    NODE NODE   NODE    NODE   64-127       1
GPU6   SYS  SYS  SYS  SYS  NODE NODE X    NODE   PHB     PHB    64-127       1
GPU7   SYS  SYS  SYS  SYS  NODE NODE NODE X      NODE    NODE   64-127       1
mlx5_0 SYS  SYS  SYS  SYS  NODE NODE PHB  NODE   X       PIX    25 GBE
mlx5_1 SYS  SYS  SYS  SYS  NODE NODE PHB  NODE   PIX     X      (25 GbE not connected)

Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge

Topology of the octo A40 nodes according to nvidia-smi topo -m; AMD Milan processor in NPS=4 mode:

       GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7   mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0   X    SYS  SYS  SYS  SYS  SYS  SYS  SYS    SYS     SYS    48-63        3
GPU1   SYS  X    SYS  SYS  SYS  SYS  SYS  SYS    SYS     SYS    32-47        2
GPU2   SYS  SYS  X    SYS  SYS  SYS  SYS  SYS    SYS     SYS    16-31        1
GPU3   SYS  SYS  SYS  X    SYS  SYS  SYS  SYS    SYS     SYS     0-15        0
GPU4   SYS  SYS  SYS  SYS  X    SYS  SYS  SYS    SYS     SYS   112-127       7
GPU5   SYS  SYS  SYS  SYS  SYS  X    SYS  SYS    SYS     SYS    96-111       6
GPU6   SYS  SYS  SYS  SYS  SYS  SYS  X    SYs    PHB     PHB    80-95        5
GPU7   SYS  SYS  SYS  SYS  SYS  SYS  SYS  X      SYS     SYs    64-79        4
mlx5_0 SYS  SYS  SYS  SYS  SYS  SYS  PHB  SYS    X       PIX    25 GBE
mlx5_1 SYS  SYS  SYS  SYS  SYS  SYS  PHB  SYS    PIX     X      (25 GbE not connected)

Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge

Nvidia A100 GPGPU nodes

The Nvidia A100 GPGPUs belong to the Ampere generation. The native architecture is SM80 or SM_80, compute_80.

All four or eight A100 GPGPUs of a node are directly connected with each other through an NVSwitch providing 600 GB/s GPU-to-GPU bandwidth for each GPGPU.

Topology of the quad A100 nodes according to nvidia-smi topo -m; no 25 GbE / HDR200 cards yet; AMD Rome processor in NPS=2 mode

     GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X    NV4  NV4  NV4  32-63        1
GPU1 NV4  X    NV4  NV4   0-31        0
GPU2 NV4  NV4  X    NV4  96-127       3
GPU3 NV4  NV4  NV4  X    64-95        2

Legend:
X    = Self
SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
NV#  = Connection traversing a bonded set of # NVLinks

Topology of the octo A100 nodes according to nvidia-smi topo -m; AMD Milan processor in NPS=1 mode

       GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7  mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0   X    NV12 NV12 NV12 NV12 NV12 NV12 NV12  PXB     NODE   NODE   SYS      0-63        0
GPU1   NV12 X    NV12 NV12 NV12 NV12 NV12 NV12  PXB     NODE   NODE   SYS      0-63        0
GPU2   NV12 NV12 X    NV12 NV12 NV12 NV12 NV12  NODE    PXB    PXB    SYS      0-63        0
GPU3   NV12 NV12 NV12 X    NV12 NV12 NV12 NV12  NODE    PXB    PXB    SYS      0-63        0
GPU4   NV12 NV12 NV12 NV12 X    NV12 NV12 NV12  SYS     SYS    SYS    NODE    64-127       1
GPU5   NV12 NV12 NV12 NV12 NV12 X    NV12 NV12  SYS     SYS    SYS    NODE    64-127       1
GPU6   NV12 NV12 NV12 NV12 NV12 NV12 X    NV12  SYS     SYS    SYS    PXB     64-127       1
GPU7   NV12 NV12 NV12 NV12 NV12 NV12 NV12 X     SYS     SYS    SYS    PXB     64-127       1
mlx5_0 PXB  PXB  NODE NODE SYS  SYS  SYS  SYS   X       NODE   NODE   SYS    HDR200
mlx5_1 NODE NODE PXB  PXB  SYS  SYS  SYS  SYS   NODE    X      PIX   SYS     25 GbE
mlx5_2 NODE NODE PXB  PXB  SYS  SYS  SYS  SYS   NODE    PIX    X     SYS     (25 GbE not connected)
mlx5_3 SYS  SYS  SYS  SYS  NODE NODE PXB  PXB   SYS     SYS    SYS   X       HDR200

Legend:
X    = Self
SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
NV#  = Connection traversing a bonded set of # NVLinks

Topology of the octo A100 nodes according to nvidia-smi topo -m; AMD Milan processor in NPS=4 mode

       GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7  mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0   X    NV12 NV12 NV12 NV12 NV12 NV12 NV12  PXB     SYS    SYS    SYS    48-63       3
GPU1   NV12 X    NV12 NV12 NV12 NV12 NV12 NV12  PXB     SYS    SYS    SYS    48-63       3
GPU2   NV12 NV12 X    NV12 NV12 NV12 NV12 NV12  SYS     PXB    PXB    SYS    16-31       1
GPU3   NV12 NV12 NV12 X    NV12 NV12 NV12 NV12  SYS     PXB    PXB    SYS    16-31       1
GPU4   NV12 NV12 NV12 NV12 X    NV12 NV12 NV12  SYS     SYS    SYS    SYS    112-127     7
GPU5   NV12 NV12 NV12 NV12 NV12 X    NV12 NV12  SYS     SYS    SYS    SYS    112-127     7
GPU6   NV12 NV12 NV12 NV12 NV12 NV12 X    NV12  SYS     SYS    SYS    PXB    80-95       5
GPU7   NV12 NV12 NV12 NV12 NV12 NV12 NV12 X     SYS     SYS    SYS    PXB    80-95       5
mlx5_0 PXB  PXB  SYS  SYS  SYS  SYS  SYS  SYS   X       SYS    SYS    SYS    HDR200
mlx5_1 SYS  SYS  PXB  PXB  SYS  SYS  SYS  SYS   SYS     X      PIX    SYS    25 GbE
mlx5_2 SYS  SYS  PXB  PXB  SYS  SYS  SYS  SYS   SYS     PIX    X      SYS    (25 GbE not connected)
mlx5_3 SYS  SYS  SYS  SYS  SYS  SYS  PXB  PXB   SYS     SYS    SYS    X      HDR200

Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks