Fritz#

Fritz is a parallel CPU cluster with Intel Ice Lake and Sapphire Rapids processors, uses an InfiniBand (IB) network, and has a Lustre-based parallel filesystem accessible under $FASTTMP. The cluster is accessible for NHR users and on request for Tier3 users.

# nodes	CPUs and # cores per node	main memory per node	Slurm partition
992	2 x Intel Xeon Platinum 8360Y ("Ice Lake"), 2 x 36 cores @2.4 GHz	256 GB	`singlenode`, `multinode`
48	2 x Intel Xeon Platinum 8470 ("Sapphire Rapids"), 2 x 52 cores @2.0 GHz	1 TB	`spr1tb` (1)
16	2 x Intel Xeon Platinum 8470 ("Sapphire Rapids"), 2 x 52 cores @2.0 GHz	2 TB	`spr2tb` (1)

(1) Available only on request to NHR projects.

The login nodes fritz[1-4] have 2 x Intel Xeon Platinum 8360Y ("Ice Lake") processors with 512 GB main memory. See Further information for more technical details about the cluster.

The remote visualization node fviz1 have 2 x Intel Xeon Platinum 8360Y ("Ice Lake") processors with 1 TB main memory, one Nvidia A16 GPU, 30 TB of local NVMe SSD storage. See Remote visualization for using fviz1.

Accessing Fritz#

FAU HPC accounts do not have access to Fritz by default. Request access by filling out the form

https://hpc.fau.de/tier3-access-to-fritz/

See configuring connection settings or SSH in general for configuring your SSH connection.

If successfully configured, Fritz can be accessed via SSH by:

ssh fritz.nhr.fau.de

You will then be redirected to one of the login nodes.

Software#

Fritz runs AlmaLinux 8 that is binary compatible with Red Hat Enterprise Linux 8.

All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.

For available software see:

Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs module. You can install software yourself by using the user-spack functionality.

Containers, e.g. Docker, are supported via Apptainer.

Best practices, known issues#

Specific applications:

Machine learning frameworks:

Debugger:

GDB

Python, conda, conda environments#

Through the python module, an Anaconda installation is available. See our Python documentation for usage, initialization, and working with conda environments.

Compiler#

For a general overview about compilers, optimizations flags, and targeting a certain or multiple CPU micro-architecture see the compiler documentation.

Fritz has nodes with two different CPU micro-architectures:

Intel Ice Lake Server
Intel Sapphire Rapids (--partition=sp1tb or --partition=sb2tb)

The frontend nodes have Ice Lake Server CPUs.

Code compiled exclusively for Sapphire Rapids CPUs cannot run on Ice Lake Server CPUs.

For best performance compile your binary for the targeted nodes. If you want to use binaries optimized for both Ice Lake and Sapphire Rapids nodes either

have two instances of your application, each compiled for the corresponding architecture or
see the targeting multiple architectures

To compile code that runs on all nodes in Fritz, compile on the front ends and:

let the compiler use the host system as optimization target, see compiler documentation for flags
set Ice Lake Server as target architecture.

The following table shows the compiler flags for targeting a certain CPU micro-architecture:

CPU	GCC, LLVM, Intel OneAPI/Classic
all nodes	`-march=icelake-server`
Ice Lake Server	`-march=icelake-server`
Sapphire Rapids	`-march=sapphirerapids`

Older version of the compilers might not know about sapphirerapids or icelake-server. In the case you cannot switch to a newer one, you might try targeting older architectures, in decreasing order: sapphirerapids, icelake-server, skylake-avx512 or consult your compiler's documentation.

Filesystems#

On all front ends and nodes the filesystems $HOME, $HPCVAULT, and $WORK are mounted. For details see the filesystems documentation.

Node-local job-specific RAM disk `$TMPDIR`#

Data stored on $TMPDIR will be deleted when the job ends.

Each cluster node has a local job-specific RAM disk, reachable under $TMPDIR. All data you store there, cuts away from the available RAM your application can use on the specific node.

Using $TMPDIR can be beneficial if you have at lot of fine grained I/O, e.g. frequently writing log files. In this case, I/O can be performed to files located in $TMPDIR that are copied to $WORK at the end of the job

Parallel filesystem `$FASTTMP`#

The parallel filesystem $FASTTMP is mounted on all frontends and cluster nodes of Fritz. It is also accessible through Alex, but only with a lower bandwidth.

$FASTTMP is not backed up, store important data under $HOME or $HPCVAULT.

	Parallel filesystem
Mount point	`/lustre/$GROUP/$USER/`
Access via	`$FASTTMP`
Purpose	High performance parallel I/O; short-term storage
Capacity	3.5 PB
Technology	Lustre-based parallel filesystem
Backup	No
Data lifetime	high-watermark deletion
Quota	number of files

High watermark deletion

$FASTTMP is for high-performance short-term storage only. When the filling of the filesystem exceeds a certain limit (e.g. 80%), a high-watermark deletion will be run, starting with the oldest and largest files.

Intended I/O usage

$FASTTMP supports parallel I/O using the MPI-I/O functions and can be accessed with an aggregate bandwidth of > 20 GB/s (inside Fritz only). Use $FASTTMP only for large files. Ideally the files are written by many nodes simultaneous, e.g., for checkpointing with MPI-IO.

$FASTTMP is not made for handling large amounts for small files

Parallel filesystems achieve their speed by writing to multiple servers at the same time. Files are distributed in the granularity of blocks over the servers. On $FASTTMP a block has the a size of 1 MB. Files smaller than 1 MB will reside only on one server. Additional overhead of the parallel filesystem causes a slower access than traditional NFS servers. For that reason, we have set a limit on the number of files you can store there.

Batch processing#

Resources are controlled through the batch system Slurm. Do not run your applications on the front ends, they should only be used for compiling.

The granularity of batch allocations are complete nodes and that are allocated exclusively.

Available partitions and their properties:

partition	min – max walltime	# nodes per job	CPU cores per node	memory per node	Slurm options
`singlenode` (default)	0 – 24:00:00	1	72	256 GB
`multinode`	0 – 24:00:00	2 – 64	72	256 GB RAM
`spr1tb` (1)	0 – 24:00:00	1 – 8	104	1 TB	`-p spr1tb`
`spr2tb` (1)	0 – 24:00:00	1 – 2	104	2 TB	`-p spr2tb`
`big` (2)	0 – 24:00:00	65 – 256	72	256 GB	`-p big`

(1) Available only on request to NHR projects. (2) Available on request only.

Interactive job#

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The environment from the calling shell, like loaded modules, will be inherited by the interactive job.

Interactive single-node job#

The following will allocate one Ice Lake node (-N 1) from the partition singlenode (the default partition) for one hour (--time=01:00:00) and provide you with an interactive shell on this node:

salloc -N 1 --time=01:00:00

To allocate an interactive job with the same properties but on a Sapphire Rapids nodes you also have to specify the partition --partition=spr1tb or --partition=spr2tb:

salloc -N 1 --partition=spr1tb --time=01:00:00

Interactive multi-node job#

The following will allocate four Ice Lake nodes (-N 4) from the multinode partition (automatically chosen when more than one node is requested) for one hour (--time=01:00:00) and provide you with an interactive shell on this node:

salloc -N 4 --time=01:00:00

To allocate an interactive job with the same properties but on Sapphire Rapids nodes you also have to change the partition to spr1tb or spr2tb:

salloc -N 4 --partition=spr1tb --time=01:00:00

Batch job script examples#

MPI parallel job (single-node)#

In this example, the executable will be run on 1 1 node with 72 MPI processes per node.

#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=72
#SBATCH --time=2:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

srun ./application

OpenMP job (single-node)#

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables:OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

In this example, the executable will be run using 72 OpenMP threads (i.e. one per physical core) for a total job walltime of 2 hours.

#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=72
#SBATCH --time=2:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 

./application

Hybrid OpenMP/MPI job (single-node)#

Warning

In recent Slurm versions, the value of --cpus-per-task is no longer automatically propagated to srun, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK.

In this example, the executable will be run on one node using 2 MPI processes with 36 OpenMP threads (i.e. one per physical core) for a total job walltime of 1 hours.

#!/bin/bash -l

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=36
#SBATCH --time=1:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun ./hybrid_application

MPI parallel job (multi-node)#

In this example, the executable will be run on 4 nodes with 72 MPI processes per node.

#!/bin/bash -l
#
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=72
#SBATCH --time=2:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

srun ./application

Hybrid OpenMP/MPI job (multi-node)#

Warning

In recent Slurm versions, the value of --cpus-per-task is no longer automatically propagated to srun, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK.

In this example, the executable will be run on 4 nodes using 2 MPI processes per node with 36 OpenMP threads each (i.e. one per physical core) for a total job walltime of 2 hours.

#!/bin/bash -l

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=36
#SBATCH --time=2:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun ./hybrid_application

Attach to a running job#

See the general documentation on batch processing.

Remote visualization#

fviz1 can be used for remote visualization with VirtualGL. On the node you can use visualization tools and access large datasets directly from the mounted filesystems $HOME, $HPCVAULT, $WORK, and $FASTTMP.

fviz1 contains one Nvidia A16 GPU, partitioned into 4 virtual GPUs, allowing up to 4 users to use it at the same time. Each GPU has 16 GB RAM available to it. You cannot request more than one GPU at the same time.

The GPU(s) in fviz1 are unsuitable for machine learning applications and it is not permitted to use them for that.

The visualization node and this documentation are still work in progress. You should expect to experience problems. Feel free to report them.

Prerequisites:

You can connect to Fritz via SSH.
You can submit jobs on Fritz.
You need a VNC client/server on your local machine. We recommend TurboVNC, but other VNC clients/viewers may also work (though probably with worse performance). You need to be able to reach one port on fviz1 network-wise, either by being directly within the university network, using VPN, or by doing an SSH tunnel

For remote visualization, VirtualGL is used. On the visualization node, an X server is started, that has access to the hardware acceleration of the GPUs. The resulting display output is then grabbed by a VNC server, and transported to the client through the VNC protocol. This means that you will need a VNC client/viewer on your end to display what the visualization node sends to you.

Access to the visualization node is through the batch system. We have prepared a few scripts to make requesting the visualization node and running the necessary job script easier: You should be able to simply call

/apps/virtualgl/submitvirtualgljob.sh --time=hours:minutes:seconds

where hours/minutes/seconds specify the time limit for the job, e.g. /apps/virtualgl/submitvirtualgljob.sh --time=2:0:0 for a two hour job.

This will queue a job that will start the server side of VirtualGL, wait for it to run, and then display instructions on how to connect to it. Here is an example output from such a job - shortened and with relevant parts highlighted:

b999dc99@fritz:~$ /apps/virtualgl/submitvirtualgljob.sh --time=2:0:0
Job has been queued, with JobID 4711 - waiting for it to start...
..............................Job seems to have started!
### Starting TaskPrologue of job 4711 on fviz1 at Fri Nov 24 01:48:47 CET 2023
[...]
Desktop 'TurboVNC: fviz1.nhr.fau.de:10 (b999dc99)' started on display fviz1.nhr.fau.de:10

One-Time Password authentication enabled.  Generating initial OTP ...
Full control one-time password: 123456789
Run '/opt/TurboVNC/bin/vncpasswd -o' from within the TurboVNC session or
    '/opt/TurboVNC/bin/vncpasswd -o -display :10' from within this shell
    to generate additional OTPs
[...]
When you're done, don't forget to properly end the VNC session, or cancel it with
 scancel 4711

If you are within the university network or connected via VPN, you can simply tell your VNC client/viewer to connect to the display in the message (in the example above that would be fviz1.nhr.fau.de:10). If you're not, you need to create an SSH tunnel. For this we recommend to use the dialog server csnhr. To calculate the port, you need to add 5900 to the number after the colon. In the above example, that is 10, so the correct port would be 5910. Start an SSH tunnel with something like ssh -L 5910:fviz.nhr.fau.de:5910 yourusername@csnhr.nhr.fau.de. You can then direct you VNC client/viewer to connect to the display localhost:10.

When asked for the password, enter the password from the line Full control one-time password (in the example above, that would be 123456789).

You should now see a remote desktop running on fviz1. Any applications you start there should see that there is a graphics card that supports 3D acceleration available, and be generally usable (although you should not expect miracles). If things do not run smoothly, make sure to use TurboVNC as a client.

When you're done, please exit the remote desktop properly by clicking log out on the remote desktop, and don't just close the VNC window, as that would leave the desktop running on the server until the time limit is reached, blocking the GPU for use by other users.

Further information#

Performance#

Measured LINPACK performance of

1.84 PFlop/s on 512 nodes in April 2022,
2.233 PFlop/s on 612 nodes in May 2022 resulting in place 323 of the June 2022 Top500 list, and
3.578 PFlop/s on 986 nodes in November 2022 resulting in place 151 of the November 2022 Top500 list.

Nodes and processor details#

Nodes overview:

partition	`singlenode`, `multinode`	`spr1tb`	`spr2tb`
no. of nodes	992	48	16
processors	2 x Intel Xeon Platinum 8360Y	2 x Intel Xeon Platinum 8470	2 x Intel Xeon Platinum 8470
Microarchitecture	Ice Lake	Sapphire Rapids	Sapphire Rapids
no .of cores	2 x 36 = 72	2 x 52 = 104	2 x 52 = 104
base frequency	2.4 GHz	2.0 GHz	2.0 GHz
L3 cache	2 x 54 MB = 108 MB	2 x 105 MB = 210 MB	2 x 105 MB = 210 MB
memory	256 GB	1 TB	2 TB
NUMA LDs	4	8	8

All nodes have SMT disabled and sub-NUMA clustering enabled.

Processors used:

partition	`singlenode`, `multinode`	`spr1tb`, `spr2tb`
processor	Intel Xeon Platinum 8360Y	Intel Xeon Platinum 8470
Microarchitecture	Ice Lake	Sapphire Rapids
cores (SMT threads)	36 (72)	52 (104)
SMT	disabled	disabled
max Turbo frequency (1)	3.5 GHz	3.8 GHz
base frequency (1)	2.4 GHz	2.0 GHz
L1 data cache per core	48 KB	48 KB
L2 cache per core	1280 KB	2048 KB
last level cache (L3)	54 MB	105 MB
memory channels/type p. node	16 x 16 GB DDR4-3200	16 x 64/128 GB DDR5-4800
no. of UPI links	3	4
TDP	250 W	350 W
instruction set extensions	SSE, AVX, AVX2, AVX-512	SSE, AVX, AVX2, AVX-512
no. of AVX-512 FMA units	2	2
Intel ARK	link	link

(1) Turbo and base frequency can be lower when executing AVX(2) and AVX-512 instructions.

Logical schema of Ice Lake node

lstopo --of svg > fritz-icx-full.svg

Logical schema of Sapphire Rapids node

lstopo --of svg > fritz-spr-1tb-full.svg

Network topology#

Fritz uses blocking HDR100 Infiniband with up to 100 GBit/s bandwidth per link and direction. There are islands with 64 nodes (i.e. 4.608 cores). The blocking factor between islands is 1:4.

Fritz uses unmanaged 40 port Mellanox HDR switches. 8 HDR200 links per edge switch are connected to the spine level. Using splitter cables, 64 compute nodes are connected with HDR100 per each edge switch. This results in a 1:4 blocking of the fat tree. Each island with 64 nodes has a total of 4.608 cores. Slurm is aware of the topology, but minimizing the number of switches per jobs does not have a high priority.

Direct liquid cooling (DLC) of the compute nodes#

Ice Lake nodes:

Sapphire Rapids nodes:

Name#

The name "Fritz" is a play with the name of FAU's founder Friedrich, Margrave of Brandenburg-Bayreuth (1711-1763).

Financing#

Fritz has been financed by:

German Research Foundation (DFG) as part of INST 90/1171-1 (440719683)
NHR funding of federal and state authorities (BMBF and Bavarian State Ministry of Science and the Arts, respectively)
eight of the Sapphire Rapid nodes are dedicated to HS Coburg as part of the BMBF proposal "HPC4AAI" within the call "KI-Nachwuchs@FH"
financial support of FAU to strengthen HPC activities

Fritz#

Accessing Fritz#

Software#

Best practices, known issues#

Python, conda, conda environments#

Compiler#

Filesystems#

Node-local job-specific RAM disk $TMPDIR#

Parallel filesystem $FASTTMP#

Batch processing#

Interactive job#

Interactive single-node job#

Interactive multi-node job#

Batch job script examples#

MPI parallel job (single-node)#

OpenMP job (single-node)#

Hybrid OpenMP/MPI job (single-node)#

MPI parallel job (multi-node)#

Hybrid OpenMP/MPI job (multi-node)#

Attach to a running job#

Remote visualization#

Further information#

Performance#

Nodes and processor details#

Network topology#

Direct liquid cooling (DLC) of the compute nodes#

Name#

Financing#

Node-local job-specific RAM disk `$TMPDIR`#

Parallel filesystem `$FASTTMP`#