Index

Test cluster

The NHR@FAU test and benchmark cluster is an environment for porting software to new CPU architectures and running benchmark tests. It comprises a variety of nodes with different processors, clock speeds, memory speeds, memory capacity, number of CPU sockets, etc. There is no high-speed network, and MPI parallelization is restricted to one node. The usual NFS file systems are available.

This is a testing ground. Any job may be canceled without prior notice. For further information about proper usage, please contact NHR@FAU.

This is a quick overview of the systems including their host names (frequencies are nominal values) – NDA systems are not listed:

aurora1: Single Intel Xeon “Skylake” Gold 6126 CPU (12 cores + SMT) @ 2.60GHz.
Accelerators: 2x NEC Aurora “TSUBASA” 10B (48 GiB RAM)
broadep2: Dual Intel Xeon “Broadwell” CPU E5-2697 v4 (2x 18 cores + SMT) @ 2.30GHz, 128 GiB RAM
casclakesp2: Dual Intel Xeon “Cascade Lake” Gold 6248 CPU (2x 20 cores + SMT) @ 2.50GHz, 384 GiB RAM
euryale: Dual Intel Xeon “Broadwell” CPU E5-2620 v4 (2x 8 cores) @ 2.10GHz, 64 GiB RAM
Accelerator: AMD RX 6900 XT (16 GB)
genoa1: Dual AMD EPYC 9654 “Genoa” CPU (2x 96 cores + SMT) @ 2.40GHz, 768 GiB RAM
genoa2: Dual AMD EPYC 9354 “Genoa” CPU (2x 32 cores + SMT) @ 3.25GHz, 768 GiB RAM.
Accelerators:
– ~~NVIDIA A40 (48 GiB GDDR6)~~NVIDIA L40s (48 GiB GDDR6)
– NVIDIA L40 (48 GiB GDDR6)
gracehop1: ARM aarch64
gracesup1: ARM aarch64
hasep1: Dual Intel Xeon “Haswell” E5-2695 v3 CPU (2x 14 cores + SMT) @ 2.30GHz, 64 GiB RAM
icx32: Dual Intel Xeon “Icelake” Platinum 8358 CPU (2x 32 cores + SMT) @ 2.60GHz, 256 GiB RAM
icx36: Dual Intel Xeon “Icelake” Platinum 8360Y CPU (2x 36 cores + SMT) @ 2.40GHz, 256 GiB RAM
interlagos1: Dual AMD Opteron 6276 “Interlagos” CPU (2x 16 cores) @ 2.3 GHz, 64 GiB RAM.
Accelerator: AMD Radeon VII GPU (16 GiB HBM2)
ivyep1: Dual Intel Xeon “Ivy Bridge” E5-2690 v2 CPU (2x 10 cores + SMT) @ 3.00GHz, 64 GiB RAM
lukewarm: Dual ARM Ampere Altra Max M128-30 (2x 128 cores) @ 2.8 GHz, 512 GB RAM (DDR4-3200); ARM aarch64
medusa: Dual Intel Xeon “Cascade Lake” Gold 6246 CPU (2x 12 cores + SMT) @ 3.30GHz, 192 GiB RAM.
~~Accelerators:~~
~~– NVIDIA GeForce RTX 2070 SUPER (8 GiB GDDR6)~~
~~– NVIDIA GeForce RTX 2080 SUPER (8 GiB GDDR6)~~
~~– NVIDIA Quadro RTX 5000 (16 GiB GDDR6)~~
~~– NVIDIA Quadro RTX 6000 (24 GiB GDDR6)~~
optane1: Dual Intel Xeon “Ice Lake” Platinum 8362 CPU (2x 32 cores + SMT) @ 2.80 GHz, 256 GiB RAM, 1024 GiB Optane Memory
milan1: Dual AMD EPYC 7543 “Milan” CPU (32 cores + SMT) @ 2.8 GHz, 256 GiB RAM
Accelerator: AMD MI210 (64 GiB HBM2e)
naples1: Dual AMD EPYC 7451 “Naples” CPU (2x 24 cores + SMT) @ 2.3 GHz, 128 GiB RAM
phinally: Dual Intel Xeon “Sandy Bridge” CPU E5-2680 (8 cores + SMT) @ 2.70GHz, 64 GiB RAM
rome1: Single AMD EPYC 7452 “Rome” CPU (32 cores + SMT) @ 2.35 GHz, 128 GiB RAM
rome2: Dual AMD EPYC 7352 “Rome” CPU (24 cores + SMT) @ 2.3 GHz, 256 GiB RAM
Accelerators:
– AMD MI100 (32 GiB HBM2)
– AMD MI210 (64 GiB HBM2e)
skylakesp2: Intel Xeon “Skylake” Gold 6148 CPU (2x 20 cores + SMT) @ 2.40GHz, 96 GiB RAM
summitridge1: AMD Ryzen 7 1700X CPU (8 cores + SMT), 32 GiB RAM
warmup: Dual Cavium/Marvell “ThunderX2” (ARMv8) CN9980 (2x 32 cores + 4-way SMT) @ 2.20 GHz, 128 GiB RAM; ARM aarch64

Technical specifications of all more or less recent GPUs available at RRZE (either in the Testcluster or in TinyGPU):

	RAM	BW [GB/s]	Ref Clock [GHz]	Cores Shader/TMUs/ROPs	TDP [W]	SP [TFlop/s]	DP [TFlop/s]	Host	Host CPU (base clock frequency)
Nvidia Geforce GTX980	4 GB GDDR5	224	1,126	2.048/128/64	180	4,98	0,156	~~tg00x~~	~~Intel Xeon Nehalem X5550 (4 Cores, 2.67GHz)~~
Nvidia Geforce GTX1080	8 GB GDDR5	320	1,607	2.560/160/64	180	8,87	0,277	tg03x	Intel Xeon Broadwell E5-2620 v4 (8 C, 2.10GHz)
Nvidia Geforce GTX1080Ti	11 GB GDDR5	484	1,480	3.584/224/88	250	11,34	0,354	tg04x	Intel Xeon Broadwell E5-2620 v4 (2x 8 C, 2.10GHz)
Nvidia Geforce RTX2070Super	8 GB GDDR6	448	1,605	2.560/160/64	215	9,06	0,283	medusa	Intel Xeon Cascadelake Gold 6246 (2x 12 C, 3.30GHz)
Nvidia Quadro RTX5000	16 GB GDDR6	448	1,620	3.072/192/64	230	11,15	0,348	medusa	Intel Xeon Cascadelake Gold 6246 (2x 12 C, 3.30GHz)
Nvidia Geforce RTX2080Super	8 GB GDDR6	496	1,650	3.072/192/64	250	11,15	0,348	medusa	Intel Xeon Cascadelake Gold 6246 (2x 12 C, 3.30GHz)
Nvidia Geforce RTX2080Ti	11 GB GDDR6	616	1,350	4.352/272/88	250	13,45	0,420	tg06x	Intel Xeon Skylake Gold 6134 (2x 8 Cores + SMT, 3.20GHz)
Nvidia Quadro RTX6000	24 GB GDDR6	672	1,440	4608/288/96	260	16,31	0,510	medusa	Intel Xeon Cascadelake Gold 6246 (2x 12 C, 3.30GHz)
Nvidia Geforce RTX3080	10 GB, GDDR6X	760	1.440	8.704	320	29.77	0.465	tg08x	Intel Xeon IceLake Gold 6226R (2x 32 cores + SMT, 2.90GHz)
Nvidia Tesla V100 (PCIe, passive)	32 GB HBM2	900	1,245	5.120 Shader	250	14,13	7,066	tg07x	Intel Xeon Skylake Gold 6134 (2x 8 Cores + SMT, 3.20GHz)
Nvidia A40 (passiv)	48 GB GDDR6	696	1.305	10.752 Shader	300	37.42	1,169	genoa2	AMD Genoa 9354 (2x 32 cores + SMT, 3.25 GHz)
Nvidia A100 (SMX4/NVlink, passive)	40 GB HBM2	1.555	1.410	6.912 Shader	400	19,5	9.7	tg09x	AMD Rome 7662 (2x 64 Cores, 2.0GHz)
Nvidia L40 (passiv)	48 GB GDDR6	864	0.735	18.176 Shader	300	90.52	1.414	genoa2	AMD Genoa 9354 (2x 32 cores + SMT, 3.25 GHz)
AMD Instinct MI100 (PCIe Gen4, passive)	32 GB HBM2	1229	1,502	120 Compute Units / 7680 Cores	300	21,1	11,5	rome2	AMD Rome 7352 (2x 24 cores + SMT, 2.3 GHz)
AMD Radeon VII	16 GB HBM2	1,024	1,400	3,840/240/64	300	13.44	3.360	interlagos1	AMD Interlagos Opteron 6276
AMD Instinct MI210 (PCIe Gen4, passive)	64 GB HBM2e	1,638	1,000	104 Compute Units / 6,656 Cores	300	22.6	22.6	milan1, rome2	AMD Milan 7543 (2×32 cores + SMT, 2.8 GHz), AMD Rome 7352 (2x 24 cores + SMT, 2.3 GHz)

This website shows information regarding the following topics:

Access, User Environment, File Systems

Access, User Environment, and File Systems

Access to the machine

Note that access to the test cluster is restricted: If you want access to it, you will need to contact hpc@rrze. In order to get access to the NDA machines you have to provide a short (!) description of what you want to do there.

From within the FAU network, users can connect via SSH to the frontend
testfront.rrze.fau.de
If you need access from outside of FAU, you usually have to connect for example to the dialog server cshpc.rrze.fau.de first and then ssh to testfront from there.

While it is possible to ssh directly to a compute node, a user is only allowed to do this while they have a batch job running there. When all batch jobs of a user on a node have ended, all of their processes, including any open shells, will be killed automatically.

The login nodes and most of the compute nodes run Ubuntu 18.04. As on most other RRZE HPC systems, a modules environment is provided to facilitate access to software packages. Type “module avail” to get a list of available packages. Note that, depending on the node, the modules may be different due to the wide variety of architectures. Expect inconsistencies. In case of questions, contact hpc@rrze.

File Systems

The nodes have local hard disks of very different capacities and speeds. These are not production systems, so do not expect a production environment.

When connecting to the front end node, you’ll find yourself in your regular RRZE $HOME directory (/home/hpc/...). There are relatively tight quotas there, so it will most probably be too small for the inputs/outputs of your jobs. It however does offer a lot of nice features, like fine grained snapshots, so use it for “important” stuff, e.g. your job scripts, or the source code of the program you’re working on. See the HPC file system page for a more detailed description of the features and the other available file systems including, e.g., $WORK.

Batch processing

As with all production clusters at RRZE, resources are controlled through a batch system, SLURM in this case. Due to the broad spectrum of architectures in the test cluster, it is usually advisable to compile on the target node using an interactive SLURM job (see below).

There is a “work” queue and an “nda” queue, both with up to 24 hours of runtime. Access to the “nda” queue is restricted because the machines tied to this queue are pre-production hardware or otherwise special so that benchmark results must not be published without further consideration.

Batch jobs can be submitted on the frontend. The default job runtime is 10 minutes.

The currently available nodes can be listed using:

sinfo -o "%.14N %.9P %.11T %.4c %.8z %.6m %.35f"

To select a node, you can either use the host name or a feature name from sinfo:

sbatch --nodes=1 --constraint=featurename --time=hh:mm:ss --export=NONE jobscript
sbatch --nodes=1 --nodelist=hostname --time=hh:mm:ss --export=NONE jobscript

By default, SLURM exports the environment of the shell where the job was submitted. If this is not desired, use --export=NONE and unset SLURM_EXPORT_ENV. Otherwise, problems may arise on nodes that do not run Ubuntu.

Submitting an interactive job:

salloc --nodes=1 --nodelist=hostname --time=hh:mm:ss

For getting access to performance counter registers and other restricted parts of the hardware (so that likwid-perfctr and other LIKWID tools works as intended), use the constraint -C hwperf. The Linux kernel’s NUMA balancing feature can be turned off with -C numa_off. When the system should use huge pages transparently for the applications use -C thp_always to switch to always mode. For the specification of multiple constraints, combine them with & and proper quoting like -C "hwperf&thp_always".

Please see the batch system description for further details.

VASP

Vienna Ab initio Simulation Package (VASP) is a computer program for atomic scale materials modeling, e.g. electronic structure calculations and quantum-mechanical molecular dynamics, from first principles.

Availability / Target HPC systems

VASP requires an individual license.

Notes

Parallelization and optimal performance:
- (try to) always use full nodes (PPN=20 for Meggie)
- NCORE=5/10 & PPN=20 results in optimal performance in almost all cases, in general NCORE should be a divisor of PPN
- OpenMP parallelization is supposed to supersede NCORE
- use kpar if possible
Compilation:
- use -Davoidalloc
- use Intel toolchain and MKL
- in case of very large jobs with high memory requirements add ‘ -heap-arrays 64’ to Fortran flags before compilation (only possible for Intel ifort)
Filesystems:
- Occasionally VASP user reported failing I/O on Meggie’s $FASTTMP (/lxfs), this might be a problem with Lustre and Fortran I/O. Please try to use the fix described here: https://github.com/RRZE-HPC/getcwd-autoretry-preload
- Since VASP does not do parallel MPI I/O, $WORK is more appropriate than $FASTTMP
- For medium sized jobs, even node local /dev/shm/ might be an option
Walltime limit:
- VASP can only be gracefully stopped by creating the file “STOPCAR” https://www.vasp.at/wiki/index.php/STOPCAR automatic creation is shown in the example scripts below

Sample job scripts

#! /bin/bash -l
#
#SBATCH --nodes=4
#SBATCH --tasks-per-node=20
#SBATCH --time=24:00:00
#SBATCH --job-name=my-vasp
#SBATCH --mail-user=my.mail
#SBATCH --mail-type=ALL
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
#enter submit directory
cd $SLURM_SUBMIT_DIR
#load modules
module load intel
module load intelmpi
module load mkl
#set PPN and pinning
export PPN=20
export I_MPI_PIN=enable
#define executable:
VASP=/path-to-your-vasp-installation/vasp
#create STOPCAR with LSTOP 1800s before reaching walltimelimit
lstop=1800
#create STOPCAR with LABORT 600s before reaching walltimelimit
labort=600
#automatically detect how much time this batch job requested and adjust the 
# sleep accordingly 
TIMELEFT=$(squeue -j $SLURM_JOBID -o %L -h)
HHMMSS=${TIMELEFT#*-}
[ $HHMMSS != $TIMELEFT ] && DAYS=${TIMELEFT%-*} 
IFS=: read -r HH MM SS <<< $TIMELEFT
[ -z $SS ] && { SS=$MM; MM=$HH; HH=0 ; }
[ -z $SS ] && { SS=$MM; MM=0; }
#timer for STOP = .TRUE.
SLEEPTIME1=$(( ( ( ${DAYS:-0} * 24 + 10#${HH} ) * 60 + 10#${MM} ) * 60 + 10#$SS - $lstop ))
echo "Available runtime: ${DAYS:-0}-${HH:-0}:${MM:-0}:${SS}, sleeping for up to $SLEEPTIME1, thus reserving $lstop for clean stopping/saving results"
#timer for LABORT = .TRUE.
SLEEPTIME2=$(( ( ( ${DAYS:-0} * 24 + 10#${HH} ) * 60 + 10#${MM} ) * 60 + 10#$SS - $labort ))
echo "Available runtime: ${DAYS:-0}-${HH:-0}:${MM:-0}:${SS}, sleeping for up to $SLEEPTIME2, thus reserving $labort for clean stopping/saving results"
(sleep ${SLEEPTIME1} ; echo "LSTOP = .TRUE." > STOPCAR) &
lstoppid=!$
(sleep ${SLEEPTIME2} ; echo "LABORT = .TRUE." > STOPCAR) &
labortpid=!$
mpirun -ppn $PPN $VASP 
pkill -P $lstoppid
pkill -P $labortpid

#!/bin/bash -l
#SBATCH –nodes=2
#SBATCH –ntasks-per-node=4
#SBATCH –cpus-per-task=18
#SBATCH –partition=multinode
#SBATCH –time=01:00:00
#SBATCH –export=NONE

unset SLURM_EXPORT_ENV
module load vasp6/6.3.2-hybrid-intel-impi-AVX512

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo “OMP_NUM_THREADS=$OMP_NUM_THREADS”
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores
export OMP_PROC_BIND=true

srun /apps/vasp6/6.3.2-hybrid-intel-AVX512/bin/vasp_std >output_filename

Performance tests for VASP-6 on fritz

The calculations were performed using the binary file from module vasp6/6.3.2-hybrid-intel-impi-AVX512 for the ground state structure of sodium chloride, namely rocksalt, downloaded from The Materials Project. In order to enforce the same number of SCF iterations and ensure convergence, which in turn could be relevant to the tasks and calculations considered by VASP, we set NELMIN=26 and NELM=26.

System I:
- Single point calculations with PBE exchange-correlation functional
- Supercell containing 64 atoms
- 2x2x2 k-points
- ALGO=FAST, ENCUT=500, PREC=High, LREAL=Auto, LPLANE=True, NCORE=4, KPAR=4

Per-node speedup is defined as reference time divided by the product of the time of run and the number of nodes in each calculation i.e. T_ref /(T*nodes). T_ref is the time of calculations on one node with only MPI, higher is better.

System II:
- Single point calculations with PBE exchange-correlation functional
- Supercell containing 512 atoms
- No k-points
- ALGO=FAST, ENCUT=500, PREC=High, LREAL=Auto, LPLANE=True, NCORE=4

System III:
- Single point calculations with HSE06 exchange-correlation functional
- Supercell containing 64 atoms
- 2x2x2 k-points
- ALGO=Damped, TIME=0.4, ENCUT=500, PREC=High, LREAL=Auto, LPLANE=True, NCORE=4, KPAR=4
- Please note that in the hybrid OpenMP/MPI execution of VASP for HSE06 calculations, the default stack memory for OpenMP is insufficient and you should explicitly increase the value, otherwise your run might crash. The calculations in this section are run with “export OMP_STACKSIZE=500m” added to the submit script.

Further information

https://www.vasp.at/

Mentors

Dr. A. Ghasemi, NHR@FAU, hpc-support@fau.de
T. Klöffel, RRZE, hpc-support@fau.de
AG A. Görling (Chair of Theoretical Chemistry)

ANSYS Mechanical

ANSYS Mechanical is a computational structural mechanics software that makes it possible to solve structural engineering problems. It is available in two different software environments – ANSYS Workbench (the newer GUI-oriented environment) and ANSYS Mechanical APDL (sometimes called ANSYS Classic, the older MAPDL scripted environment).

Please note that the clusters do not come with any license. If you want to use ANSYS products on the HPC clusters, you have to have access to suitable licenses. These can be purchased directly from RRZE. To efficiently use the HPC resources, ANSYS HPC licenses are necessary.

Availability / Target HPC systems

Production jobs should be run on parallel HPC systems in batch mode. For simulations with high memory requirements, a single-node job on TinyFAT or woody can be used.

ANSYS Mechanical can also be used in interactive GUI mode via Workbench for serial pre-and/or post-processing on the login nodes. This should only be used to make quick simulation setup changes. It is NOT permitted to run computationally/memory-intensive ANSYS Mechanical simulations on login nodes.

Different versions of all ANSYS products are available via the modules system, which can be listed by module avail ansys. A special version can be loaded, e.g. by module load ansys/2022R2.

We mostly install the current versions automatically, but if something is missing, please contact hpc-support@fau.de.

Notes

Two different parallelization methods are available: shared-memory and distributed-memory parallelization.
Shared-memory parallelization: uses multiple cores on a single node; specify via ansys222 -smp -np N, default: N=2
Distributed-memory parallelization: uses multiple nodes; specify via ansys222 -dis -b -machines machine1:np:machine2:np:...

Sample job scripts

All job scripts have to contain the following information:

Resource definition for the queuing system (more details here)
Load ANSYS environment module
Generate a variable with the names of hosts of the current simulation run and specify the number of processes per host
Execute Mechanical with appropriate command line parameters (distributed memory run in batch mode)
Specify input and output file

#!/bin/bash -l
#SBATCH --job-name=ansys_mechanical
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=24:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV
# load environment module 
module load ansys/XXXX
# execute mechanical with command line parameters 
# Please insert here the correct version and your own input and output file with its correct name! 
ansysXXX -smp -np $SLURM_CPUS_PER_TASK < input.dat > output.out

#!/bin/bash -l
#SBATCH --job-name=ansys_mechanical
#SBATCH --nodes=2
#SBATCH --time=24:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV
# load environment module 
module load ansys/XXXX

# number of cores to use per node
PPN=20
# generate machine list, uses $PPN processes per node
NODELIST=$(for node in $( scontrol show hostnames $SLURM_JOB_NODELIST | uniq ); do echo -n "${node}:$PPN:"; done | sed 's/:$//')

# execute mechanical with command line parameters
# Please insert here the correct version and your own input and output file with its correct name!
ansysXXX -dis -b -machines $NODELIST < input.dat > output.out

Further information

Documentation is available within the application help manual. Further information is provided through the ANSYS Customer Portal for registered users.
More in-depth documentation is available at LRZ. Please note: not everything is directly applicable to HPC systems at RRZE!

Mentors

Dr.-Ing. Katrin Nusser, RRZE, hpc-support@fau.de
please volunteer!

IMD

IMD is a software package for classical molecular dynamics simulations. Several types of interactions are supported, such as central pair potentials, EAM potentials for metals, Stillinger-Weber and Tersoff potentials for covalent systems, and Gay-Berne potentials for liquid crystals. A rich choice of simulation options is available: different integrators for the simulation of the various thermodynamic ensembles, options that allow to shear and deform the sample during the simulation, and many more. There is no restriction on the number of particle types. (http://imd.itap.physik.uni-stuttgart.de/)

The latest versions of IMD are released under GPL-3.0.

Availability / Target HPC systems

IMD is currently not centrally installed but can be installed locally in the users’ home folders. Follow the instruction on http://imd.itap.physik.uni-stuttgart.de/userguide/compiling.html. While compiling at RRZE, first load the necessary modules (intel, intelmpi). It is recommended to clean the compilation before initiating a new compiling process, i.e. gmake clean. SpecifyIMDSYS=lima on any of RRZE’s clusters; however, only use the resulting binary on the cluster where you produced it, i.e. recompile again with IMDSYS=lima when moving to a different cluster.

If there is enough demand, RRZE might also provide a module for IMD.

Sample job scripts

#!/bin/bash -l
#
# allocate 4 nodes with 20 cores per node = 4*20 = 80 MPI tasks
#SBATCH --nodes=4
#SBATCH --tasks-per-node=20
#
# allocate nodes for 6 hours
#SBATCH --time=06:00:00
# job name 
#SBATCH --job-name=my-IMD
# do not export environment variables
#SBATCH --export=NONE
#
# first non-empty non-comment line ends SBATCH options

# do not export environment variables
unset SLURM_EXPORT_ENV
# jobs always start in submit directory

module load intel
module load intelmpi
# specify the full path of the IMD executable 
IMDCMD=$HOME/bin/imd_mpi_eam4point_fire_fnorm_homdef_stress_nbl_mono_hpo 

# input parameter file name 
PARAM=myJob.param 
# run 
srun $IMDCMD -p $PARAM

Further information

Mentors

hpc-support@fau.de
AG Bitzek (WW1 – I: General Materials Properties)

Quantum Espresso

Quantum Espresso is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials.

Availability / Target HPC systems

parallel computers: main target machines
throughput cluster Woody: might be useful for small systems, manually distributed phonon calculations

Notes on parallelization in general

please note that QE has five command-line arguments that can be provided to the binary at run time: -nimage, -npools, -nband, -ntg, -ndiag (the shorthands, respectively: -ni, -nk, -nb, -nt, -nd). They can influence the run time considerably.
try to stick to one k-point / node
do not use Hyperthreading (disabled on most systems of NHR@FAU anyways)
- e.g. Emmy, OpenMPI (3.1): mpirun –report-bindings –bind-to core –map-by ppr:1:core
use image parallelization e.g. for NEB / phonon calculations via the use of “-ni”
ask for help with the parallelization of phonon calculation
use gamma point version (KPOINTS GAMMA) instead of KPOINTS AUTOMATIC
k-point parallelization
- 1 k-point per node .e.g. -nk #nnodes
- -nk must be a divisor of #MPI tasks
-nd for #bands > 500
-nt 2,5,10 as a last resort only, and if nr3 < #MPI tasks, nr3 is the third dimension of the FFT mesh

Sample job scripts

#!/bin/bash -l
#SBATCH –nodes=1
#SBATCH –ntasks-per-node=72
#SBATCH –partition=singlenode
#SBATCH –time=01:00:00
#SBATCH –export=NONE

unset SLURM_EXPORT_ENV
module load qe/7.1
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo “OMP_NUM_THREADS=$OMP_NUM_THREADS”

srun pw.x -i input.in >output_filename

#!/bin/bash -l
#SBATCH –nodes=8
#SBATCH –ntasks-per-node=4
#SBATCH –cpus-per-task=18
#SBATCH –partition=multinode
#SBATCH –time=01:00:00
#SBATCH –export=NONE

unset SLURM_EXPORT_ENV
module load qe/7.1

srun pw.x -i input.in >output_filename

Performance tests for Quantum Espresso 7.1 on fritz

We performed the calculations using the binary file from module qe/7.1 for the ground state structure of sodium chloride, namely rocksalt, downloaded from The Materials Project. All wave-function optimizations of our single-point runs were converged in 14 iterations without enforcing the number of SCF iterations. The calculations are performed at the level of PBE exchange-correlation functional with PAW file (downloaded from PseudoDojo ) which has nine valence electrons for sodium and seven for chlorine.

System:
- Single point calculations
- Supercell containing 512 atoms
- Gamma point k-points
- ecutwfc=36.0, ecutrho = 144.0, conv_thr = 1.0d-11, mixing_beta = 0.7
- None of the performance-related arguments (mentioned at the top of this page) was used. The program makes choices that may not be the most optimal ones. For example, the default choice of QE for our system regarding the sub-groups in the diagonalization was “scalapack distributed-memory algorithm (size of sub-group: 8* 8 procs)”, it is not an optimal setup. In our benchmark, we compare the relative run time for different combinations of MPI processes and OpenMP threads and a perfect choice of the QE performance parameters would be system dependent as well as being a complicated task. Therefore, we do not tune the performance-related parameters. Nevertheless, we encourage users to tune the five parameters in a production run, in particular, if it is a computationally demanding run or a large set of similar small-scale individual runs. Please note that the following graph should be considered as a qualitative behavior of the parallel performance of QE.

Further information

https://www.quantum-espresso.org/

Mentors

Dr. A. Ghasemi, NHR@FAU, hpc-support@fau.de
AG B. Meyer (Interdisciplinary Center for Molecular Materials)

CPMD

CPMD is a parallelized plane wave / pseudopotential implementation of Density Functional Theory, particularly designed for ab-initio molecular dynamics.

Availability / Target HPC systems

CPMD requires an individual license.

Notes

TBD

Sample job scripts

none yet; please volunteer!

Further information

http://www.cpmd.org/

Mentors

T. Klöffel, RRZE, hpc-support@fau.de
AG B. Meyer (Interdisciplinary Center for Molecular Materials)

WRF

WRF – Weather Research and Forecasting (WRF) is a next-generation mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting applications.

Availability / Target HPC systems

Meggie: wrf, ncl, nco, and ncview are available as modules.
Compilation of the packages and all their dependencies has been done in an automatized fashion using the SPACK framework, Intel compilers and Intel MPI.

Notes

WRF has been compiled in the MPI-only version (“dmpar”) using Intel MPI.

Sample job scripts

TBD

Further information

TBD

Mentors

AG Mölg (Professorship of Climatology, Dept. Geography, NatFak)

ANSYS Fluent

Fluent is a general-purpose Computational Fluid Dynamics (CFD) code developed by ANSYS. It is used for a wide range of engineering applications, as it provides a variety of physical models for turbulent flows, acoustics, Eulerian and Lagrangian multiphase flow modeling, radiation, combustion, and chemical reactions, and heat and mass transfer.

Availability / Target HPC systems

Different versions of all ANSYS products are available via the modules system, which can be listed by module avail ansys. A special version can be loaded, e.g. by module load ansys/2020R1.

We mostly install the current versions automatically, but if something is missing, please contact hpc-support@fau.de.

Production jobs should be run on the parallel HPC systems in batch mode.

ANSYS Fluent can also be used in interactive GUI mode for serial pre- and/or post-processing on the login nodes (Linux: SSH Option “-X”; Windows: using PuTTY and XMing for X11-forwarding). This should only be used to make quick simulation setup changes. However, most of these can also be done in batch mode, please refer to the documentation of the fluent-specific TUI (text user interface). Please be aware that ANSYS Fluent is loading the full mesh into the login node’s memory when you open a simulation file. You should do this only with comparable small cases. It is NOT permitted to run computationally intensive ANSYS Fluent simulation runs or serial/parallel post-processing sessions with large memory consumption on login nodes.

Alternatively, Fluent can be run interactively with GUI on TinyFat (for large main memory requirements) or on a compute node.

Getting started

The (graphical) Fluent launcher is started by typing

fluent

on the command line. Here, you have to specify the properties of the simulation run: 3D or 2D, single or double precision, meshing or solver mode, and serial or parallel mode. When using Fluent in a batch job, all these properties have to be specified on the command line, e.g.

fluent 3ddp -g -t 20 -cnf="$NODELIST"

This launches a 3D, double-precision simulation. For a 2D, single-precision simulation 2dsp has to be specified. By using the -g option, no GUI or graphics are launched. If your simulation should produce graphical output, e.g. plot of convergence history in PNG or JPG format, -gu -driver null has to be used instead.

The number of processes is defined by the -t option. This number corresponds to the number of physical CPU cores that should be used. Using also SMT threads is not recommended. The hostnames of the compute nodes and the number of processes to be launched on each node have to be specified in a host list via the -cnf option. Please refer to the sample script below for more information.

For more information about the available parameters, use fluent -help.

Journal files

In contrast to ANSYS CFX and other simulation tools, submitting the .cas file is not sufficient to run a simulation on a parallel cluster. For a proper simulation run using a batch job, a simple journal file (.jou) is required to specify the specific solution steps.

Such a basic journal file contains a number of so-called TUI commands to ANSYS Fluent (TUI = Text User Interface). Details these commands can be found in the ANSYS Fluent documentation, Part II: Solution Mode; Chapter 2: Text User Interface (TUI).

Every configuration that is done in the GUI also has a corresponding TUI command. You can, therefore, change the configuration of the simulation during the simulation run, for example by adjusting the solution time step after a specified number of iterations. A simple example journal file for a steady-state simulation is given below. Please note that running a transient simulation would require different commands for time integration. The same applies when re-starting the simulation from a previous run or initialization.

The journal file has to be specified at the time of the application launch with -i <journal-file>.

Notes

ANSYS Fluent does not consist of different pre-, solver, and postprocessing applications as e.g. ANSYS CFX. Everything is included in one single-windowed GUI.
The in-build Fluent post-processing can also be run in parallel mode. Normally, much fewer processes than for simulation runs are needed. However, do not use this on the login nodes!
We recommend writing automatic backup files (every 6 to 12 hours) for longer runs to be able to restart the simulation in case of a job or machine failure. This can be specified in ANSYS Fluent under Solution → Calculation Activities → Autosave Every Iterations.
Fluent cannot stop a simulation based on elapsed time. Therefore, you have to estimate the number of iterations which will fit into your desired runtime. The above auto-save can also be useful as a precaution. Also plan enough buffer time for writing the final output, depending on your application, this can take quite a long time!
Please note that for some versions (<2023R2), the default (Intel) MPI startup mechanism is not working on meggie and fritz. This will lead to the solver hanging without producing any output. Use the option -mpi=openmpi to prevent this.
GPU support: since porting of functionalities to GPU is still ongoing, always use the newest Ansys version available! In initial benchmarks, a ratio of 1:1 for number of GPUs to CPU processes was found to be ideal.

Sample job scripts

All job scripts have to contain the following information:

Resource definition for the queuing system (more details here)
Load ANSYS environment module
Generate a file with names of hosts of the current simulation run to tell Fluent on which nodes it should run (see example below)
Execute fluent with appropriate command line parameters (available options via fluent -help)
Specify ANSYS Fluent journal file (*.jou) as input; this is used to control the execution of the simulation since *.cas files do not contain any solver control information

#!/bin/bash -l
#SBATCH --job-name=myfluent
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=20
#SBATCH --time=24:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# load environment module 
module load ansys/XXXX 

# generate node list 
NODELIST=$(for node in $( scontrol show hostnames $SLURM_JOB_NODELIST | uniq ); do echo -n "${node}:${SLURM_NTASKS_PER_NODE},"; done | sed 's/,$//')
# calculate the number of cores actually used 
CORES=$(( ${SLURM_JOB_NUM_NODES} * ${SLURM_NTASKS_PER_NODE} )) 

# execute fluent with command line parameters (in this case: 3D, double precision) 
# Please insert here your own .jou and .out file with their correct names! 
fluent 3ddp -g -t ${CORES} -mpi=openmpi -cnf="$NODELIST" -i fluent_batch.jou > outfile.out

#!/bin/bash -l
#SBATCH --job-name=myfluent
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=72
#SBATCH --time=24:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# load environment module 
module load ansys/XXXX 

# generate node list 
NODELIST=$(for node in $( scontrol show hostnames $SLURM_JOB_NODELIST | uniq ); do echo -n "${node}:${SLURM_NTASKS_PER_NODE},"; done | sed 's/,$//')
# calculate the number of cores actually used 
CORES=$(( ${SLURM_JOB_NUM_NODES} * ${SLURM_NTASKS_PER_NODE} )) 

# execute fluent with command line parameters (in this case: 3D, double precision) 
# Please insert here your own .jou and .out file with their correct names! 
fluent 3ddp -g -t ${CORES} -mpi=openmpi -cnf="$NODELIST" -i fluent_batch.jou > outfile.out

#!/bin/bash -l
#SBATCH --job-name=myfluent
#SBATCH --gres=gpu:a100:2
#SBATCH --time=24:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# load environment module 
module load ansys/2023R2


# execute fluent with command line parameters (in this case: 3D, double precision) 
# Please insert here your own .jou and .out file with their correct names! 
fluent 3ddp -g -t ${SLURM_GPUS_ON_NODE} -gpu -i fluent_batch.jou > outfile.out

;feel free to modify all subsequent lines to adapt them to your application case
;read case file
/file/read-case "./example-case.cas"

;initialization and start of steady state simulation

/solve/initialize/hyb-initialization
(format-time #f #f)
/solve/iterate 100
(format-time #f #f)

;write final output and exit
/file/write-case-data "./example-case-final.cas"

exit y

Further information

Documentation is available within the application help manual. Further information is provided through the ANSYS Customer Portal for registered users.
More in-depth documentation is available at LRZ. Please note: not everything is directly applicable to HPC systems at RRZE!

Mentors

Dr.-Ing. Katrin Nusser, RRZE, hpc-support@fau.de
please volunteer!

Matlab

MATLAB is a commercial software developed by MathWorks. It is used to solve mathematical problems and to visualize the results. It is mainly used for numerical calculations based on matrices.

Please note that the clusters do not come with any license. It also cannot be used with a personal license via the Matlab Campusvertrag. If you want to use Matlab, network licenses have to be activated for your chair.

Availability / Target HPC systems

Matlab can run either on a single CPU or on a single node by using multi-threading. Runs with more than one node are currently not supported.

For standalone-simulations, the following HPC systems are best suited:

throughput cluster Woody: best suited for smaller calculations
TinyFat: for calculations with large memory requirements

However, the best choice for a target HPC systems depends also on the location of your input data. For example, if you want to analyze large datasets which were generated by another simulation on meggie and are stored on its parallel file system, you should also use meggie for your Matlab simulations to avoid copying data.

Different versions of Matlab are available via the modules system. They may also vary between the clusters.

If you can’t see the modules but want to use them, please contact hpc-support@fau.de for activation.

Notes

If possible, run your calculation as a batch job (see example script below).
MATLAB can also be run interactively via GUI or command line. You can use an interactive job on the compute nodes for this.
Please do not use login nodes for production jobs!

Sample job scripts

#!/bin/bash -l

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=<number of cores>
#SBATCH --time=10:00:00 
#SBATCH --job-name=matlab 
#SBATCH --export=NONE 
unset SLURM_EXPORT_ENV

module load matlab/R201xx
matlab -nojvm -nodisplay -nosplash < my_matlab_script.m

Further information

https://www.mathworks.com/help/matlab/

https://de.mathworks.com/matlabcentral/

Mentors

please volunteer!

ORCA

ORCA is an ab initio quantum chemistry program package that contains modern electronic structure methods including density functional theory, many-body perturbation, coupled cluster, multireference methods, and semi-empirical quantum chemistry methods. Its main field of application is larger molecules, transition metal complexes, and their spectroscopic properties.

ORCA requires a license per individual or research group (cf. https://cec.mpg.de/orcadownload/ or the ORCA forum https://orcaforum.kofo.mpg.de/). Once you can proof that you are eligible, contact hpc-support@fau.de for activation of the ORCA module.

Availability / Target HPC systems

throughput cluster Woody and TinyFAT
owing to its limited scalability, ORCA is not suited for the parallel computers

New versions of ORCA are installed by RRZE upon request with low priority if the users provide the installation files.

Notes

orca has to be called with the full path otherwise parallel runs may fail.
The orca module will take care of loading an appropriate openmpi module, too.
ORCA often results in massive IO (“communication through files??”); thus, put temporary files into /dev/shm (RAM disk) or local scratch directory.

Sample job scripts

#!/bin/bash -l
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=4
#SBATCH --time=01:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

cd $SLURM_SUBMIT_DIR

module add orca/5.0.3

### No mpirun required as ORCA starts the parallel processes internally as needed. 
### The number of processes is specified in the input file using '%pal nprocs # end' 

${ORCABASE}/orca orca.inp "optional openmpi arguments"

Further information

https://orcaforum.kofo.mpg.de/
note in the ORCA forum on improving MKL performance on AMD Epyc processors: https://orcaforum.kofo.mpg.de/viewtopic.php?f=8&t=3340&hilit=mkl&start=20
We recommend to not only set MKL_DEBUG_CPU_TYPE=5 but to also set MKL_CBWR=AUTO as environment variables (as long as ORCA still uses Intel MKL versions before 2020.1; these environment variables no longer work with newer MKL versions)

Mentors

please volunteer!

Index

Access, User Environment, and File Systems

Access to the machine

File Systems

Batch processing

Availability / Target HPC systems

Notes

Sample job scripts

parallel Intel MPI job on Meggie

Hybrid OpenMP/MPI job (multi-node) on Fritz

Performance tests for VASP-6 on fritz

Further information

Mentors

Availability / Target HPC systems

Notes

Sample job scripts

shared parallel job on woody

distributed parallel job on meggie

Further information

Mentors

Availability / Target HPC systems

Sample job scripts

parallel IMD job on Meggie

Further information

Mentors

Availability / Target HPC systems

Notes on parallelization in general

Sample job scripts

MPI job (single-node) on Fritz

Hybrid OpenMP/MPI job (multi-node) on Fritz

Performance tests for Quantum Espresso 7.1 on fritz

Further information

Mentors

Availability / Target HPC systems

Notes

Sample job scripts

Further information

Mentors

Availability / Target HPC systems

Notes

Sample job scripts

Further information

Mentors

Availability / Target HPC systems

Getting started

Journal files

Notes

Sample job scripts

parallel job on meggie

parallel job on fritz

GPU job on alex

example journal file for steady-state simulation

Further information

Mentors

Availability / Target HPC systems

Notes

Sample job scripts

serial job on Woody

Further information

Mentors

Availability / Target HPC systems

Notes

Sample job scripts

parallel orca on a Woody node

Further information

Mentors