TinyFat#

Memoryhog and the TinyFat cluster are intended for running serial or moderately parallel (OpenMP) applications that require large amounts of memory in one machine.

Hostnames	# nodes	CPUs and # cores per node	main memory per node	node-local SSD	Slurm partition
`memoryhog`	1	2 x Intel Xeon Platinum 8360Y ("Ice Lake"), 72 cores/144 threads @2.4GHz	2 TB	n/a	interactively accessible without batch job
`tf04x`	3	2 x Intel Xeon E5-2680 v4 ("Broadwell"), 28 cores/56 threads @2.4 GHz	512 GB	1 TB	`broadwell512`
`tf05x`	8	2 x Intel Xeon E5-2643 v4 ("Broadwell"), 12 cores/24 threads @3.4 GHz	256 GB	1 TB	`broadwell256`, `long256`
`tf06x`-`tf09x`	36	2 x AMD EPYC 7502 ("Rome", "Zen2"), 64 cores/128 threads	512 GB	3.5 TB	`work`

All nodes have been purchased by specific groups or special projects. These users have priority access and nodes may be reserved exclusively for them.

Access to the machines#

TinyGPU is only available to accounts part of the "Tier3 Grundversorgung", not to NHR project accounts.

See configuring connection settings or SSH in general for configuring your SSH connection.

If successfully configured, the shared frontend node for TinyGPU and TinyFat can be accessed via SSH by:

ssh tinyx.nhr.fau.de

Software#

TinyFat runs Ubuntu 20.04 LTS.

All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.

For available software see:

Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs module. You can install software yourself by using the user-spack functionality.

Containers, e.g. Docker, are supported via Apptainer.

Python, conda, conda environments#

Through the python module, an Anaconda installation is available. See our Python documentation for usage, initialization, and working with conda environments.

Compiler#

For a general overview about compilers, optimizations flags, and targeting a certain CPU micro-architecture see the compiler documentation.

The CPU types on the frontend node and in the partitions are different. The compilation flags should be adjusted according to the partition you plan to run your code on. See the following table for details.

On nodes of the work partition, non-optimal code might be generated for the AMD processors when Intel compilers with -march=native or -xHost are used.

Software compiled specifically for Intel processors might not run on the work partition, since the nodes have AMD CPUs.

The following table shows the compiler flags for targeting TinyFat's CPUs:

partition	microarchitecture	GCC/LLVM	Intel oneAPI/Classic
all	Zen2, Broadwell	`-mavx2 -mfma` or `-march=x86-64-v3`	`-mavx2 -mfma`
`work`	Zen2	`-march=znver2`	`-mavx2 -mfma`
`broadwell*`, `long256`	Broadwell	`-march=broadwell`	`-march=broadwell`

Filesystems#

On all front ends and nodes the filesystems $HOME, $HPCVAULT, and $WORK are mounted. For details see the filesystems documentation.

Node local SSD `$TMPDIR`#

Data stored on $TMPDIR will be deleted when the job ends.

Each cluster node has a local SSD that is reachable under $TMPDIR.

For more information on how to use $TMPDIR see:

general documentation of $TMPDIR,
staging data, e.g. to speed up training,
sharing data among jobs on a node.

The capacity is 1 TB for broadwell* and long256 partition nodes and 3.5 TB for work partition nodes. The storage space of the SSD is shared among all jobs on a node. Hence, you might not have access to the full capacity of the SSD.

Batch processing#

Resources are controlled through the batch system Slurm.

The only exception is memoryhog, which can be used interactively without a batch job. Every HPC user can log in directly to memoryhog.rrze.fau.de to run their memory-intensive workloads.

Slurm commands are suffixed with `.tinyfat`#

The front end node tinyx.nhr.fau.de serves both the TinyGPU and the TinyFat cluster. To distinguish which cluster is targeted when a Slurm command is used, Slurm commands for TinyFat have the .tinyfat suffix.

This means instead of using:

srun use srun.tinyfat
salloc use salloc.tinyfat
sbatch use sbatch.tinyfat
sinfo use sinfo.tinyfat

These commands are equivalent to unsuffixed Slurm commands and using the option --clusters=tinyfat.

When resubmitting jobs from TinyFat's compute nodes themselves, only use sbatch, i.e. without the .tinyfat suffix.

Partitions#

Only single node jobs are allowed.

Compute nodes in the broadwell* and long256 partition are allocated exclusively.

Compute nodes in the work partition are shared, however, requested resources are always granted exclusively. The granularity of batch allocations are individual cores. For each requested core 8 GB of main memory are allocated. If your application needs more memory, then use the option --mem=<memory in MByte>. Request a node exclusively by using the --exclusive option.

Partition	min – max walltime	min – max cores	exclusivity	memory per node	Slurm options
`work`	0 – 2:00:00 (1)	1 – 64	shared nodes	512 GB
`work` (default)	0 – 24:00:00	1 – 64	shared nodes	512 GB
`broadwell256`	0 – 24:00:00	12	exclusive nodes	256 GB	`-p broadwell256`
`broadwell512`	0 – 24:00:00	28	exclusive nodes	512 GB	`-p broadwell512`
`long256`	0 – 60:00:00	12	exclusive nodes	256 GB	`-p long256`

(1) nodes reserved for short jobs, assigned automatically

All nodes have SMT, a.k.a. hardware threads or hyper threading, enabled, per default only one task per physical core is scheduled. To use SMT you have to specify --hint=multithread. See batch job examples for examples.

Using SMT / Hyperthreads#

Most modern architectures offer simultaneous multithreading (SMT), where physical cores of a CPU are split into virtual cores (aka. threads). This technique allows to run two instruction streams per physical core in parallel.

On all TinyFat nodes, SMT is available. When specifying --cpus-per-task (e.g. for OpenMP jobs), SMT threads are automatically used. If you do not wish to use SMT threads but only physical cores, add the option --hint=nomultithread to sbatch, srun or salloc or use #SBATCH --hint=nomultithread inside your job script.

Pure MPI jobs automatically do not use SMT threads.

Interactive jobs#

Interactive jobs can be requested by using salloc.tinyfat instead of sbatch.tinyfat and specifying the respective options on the command line.

The environment from the calling shell, like loaded modules, will be inherited by the interactive job.

Interactive job (single-core)#

The following will give you an interactive shell on one node with one core and 8 GB RAM dedicated to you for one hour:

salloc.tinyfat -n 1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

Interactive job (multiple cores)#

The following will give you an interactive shell on one node with 10 physical cores and 80 GB RAM dedicated to you for one hour:

salloc.tinyfat --cpus-per-task=10 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

Batch job script examples#

Serial job (single-core)#

In this example, the executable will be run using a single core for a total job walltime of 1 hours.

#!/bin/bash -l
#
#SBATCH --ntasks=1
#SBATCH --time=1:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

./application

MPI parallel job (single-node)#

In this example, the executable will be run using 2 MPI processes. Each process is running on a physical core and SMT threads are not used.

#!/bin/bash -l
#
#SBATCH --ntasks=2
#SBATCH --partition=work
#SBATCH --time=6:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

srun --mpi=pmi2 ./application

Hybrid MPI/OpenMP (single-node)#

Warning

In recent Slurm versions, the value of --cpus-per-task is no longer automatically propagated to srun, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK.

In this example, the executable will be run on one node using 2 MPI processes with 8 OpenMP threads (i.e. one per physical core) for a total job walltime of 6 hours. 16 cores are allocated in total and each OpenMP thread is running on a physical core. Hyperthreads are not used.

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=8
#SBATCH --time=6:00:00
#SBATCH --hint=nomultithread
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun --mpi=pmi2 ./hybrid_application

OpenMP job#

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables:OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

In this example, the executable will be run using 6 OpenMP threads (i.e. one per physical core) for a total job walltime of 4 hours.

#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
# do not use SMT threads 
#SBATCH --hint=nomultithread 
#SBATCH --time=4:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 

./application

Attach to a running job#

See the general documentation on batch processing.

TinyFat#

Access to the machines#

Software#

Python, conda, conda environments#

Compiler#

Filesystems#

Node local SSD $TMPDIR#

Batch processing#

Slurm commands are suffixed with .tinyfat#

Partitions#

Using SMT / Hyperthreads#

Interactive jobs#

Interactive job (single-core)#

Interactive job (multiple cores)#

Batch job script examples#

Serial job (single-core)#

MPI parallel job (single-node)#

Hybrid MPI/OpenMP (single-node)#

OpenMP job#

Attach to a running job#

Node local SSD `$TMPDIR`#

Slurm commands are suffixed with `.tinyfat`#