TinyGPU cluster

The TinyGPU cluster is one of a group of small special-purpose clusters. TinyGPU is intended for running applications utilizing GPU accelerators. In general, the documentation for Woody applies, this page will only list the differences.

There are a number of different machines with different types of GPUs (mostly of consumer type) in TinyGPU:

Hostnames #
nodes (GPUs)
CPUs and number of cores per machine, main memory GPU Type
(per node)
Additional comments Property
tg031-tg037 (not all nodes generally available) 7 (28) 2x Intel Xeon E5-2620v4 (“Broadwell”) @2.1 GHz
= 16 cores/SMT off; 64 GB RAM
4x NVIDIA GTX 1080 (8 GB memory) local SATA-SSD (880 GB) available under /scratchssd :gtx1080, :anygtx, :any1080, :cuda9, :cuda10
tg040-tg049
(not all nodes generally available)
10 (40) 2x Intel Xeon E5-2620v4 (“Broadwell”) @2.1 GHz
= 16 cores/SMT off; 64 GB RAM (one node has 128 GB RAM)
4x NVIDIA GTX 1080 Ti (11 GB memory) local SATA-SSD (1.8 TB) available under /scratchssd :gtx1080ti, :anygtx, :any1080, :cuda9, :cuda10
tg060-tg06b
(not all nodes generally available)
12 (48) 2x Intel Xeon Gold 6134 (“Skylake”) @3.2 GHz
= 16 cores with optional SMT; 96 GB RAM
4x NVIDIA RTX 2080 Ti (11 GB memory) local SATA-SSD (1.8 TB) available under /scratchssd :rtx2080ti, :anyrtx, :cuda10
use :smt to add the SMT threads to the cpuset of the job – the ppn value has to remain unchanged and specifies the number of physical cores, only!
tg071-tg074
(not always generally available)
4 (16) 2x Intel Xeon Gold 6134 (“Skylake”) @3.2 GHz
= 16 cores with optional SMT; 96 GB RAM
4x NVIDIA Tesla V100 (32GB memory) local NVMe-SSD (2.9 TB) available under /scratchssd :v100, :anytesla, :cuda9, :cuda10
use :smt to add the SMT threads to the cpuset of the job – the ppn value has to remain unchanged and specifies the number of physical cores, only!
tg080-tg086
(not always generally available)
7 (56) 2x Intel Xeon Gold 6226R @2.9 GHz
= 32 cores with optional SMT; 384 GB RAM
8x NVIDIA Geforce RTX3080 (10GB memory; in PCIe3x16 slot) local SATA-SSD (3.8 TB) available under /scratchssd These nodes use Slurm as batch queuing system and run already Ubuntu 20.04.
tg090-tg097
(not always generally available)
8 (32) 2x AMD Rome 7662 @2.0 GHz
= 128 cores/SMT off; 512 GB RAM
4x NVIDIA A100 SXM4/Nvlink (40GB memory) local NVMe-SSD (5.8 TB) available under /scratchssd These nodes use Slurm as batch queuing system and run already Ubuntu 20.04.

 

45 out of the 48 nodes have been purchased by specific groups or special projects. These users have priority access and nodes may be reserved exclusively for them.

Access, User Environment, and File Systems

Access to the machines

TinyGPU does not have its own frontend node.  Access to the system is granted through the woody frontend nodes via ssh. Please connect to

woody.rrze.uni-erlangen.de

and you will be routed to one of the frontends. While it is possible to ssh directly to a compute node, a user is only allowed to do this when they have a batch job running there. When all batch jobs of a user on a node have ended, all of their shells will be killed automatically.

File Systems

The following table summarizes the available file systems and their features. Also, check the description of the HPC file systems.

File system overview for the Woody cluster
Mount point Access via Purpose Technology, size Backup Data lifetime Quota
/home/hpc $HOME Storage of source, input, important results central servers YES + Snapshots Account lifetime YES (restrictive)
/home/vault $HPCVAULT Mid- to long-term storage central servers YES + Snapshots Account lifetime YES
/home/woody $WORK storage for small files NFS limited Account lifetime YES
/scratchssd $TMPDIR Temporary job data directory Node-local SSD, between 880 GB and 5.8 TB NO Job runtime NO

Node-local storage $TMPDIR

Each node has at least 880 GB of local SSD capacity for temporary files available under $TMPDIR (also accessible via /scratchssd). All files in these directories will be deleted at the end of a job without any notification. Important data to be kept can be copied to a cluster-wide volume at the end of the job, even if the job is canceled by a time limit.

Please only use the node-local SSDs if you can really profit from their use, as like all consumer SSDs they only support a limited number of writes, so in other words, by writing to them, you “use them up”.

Batch Processing

All user jobs must be submitted to the cluster by means of the batch system The submitted jobs are routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme.

The nodes on TinyGPU currently use two different batch systems: tg03x, tg04x, tg06x, and tg07x still use Torque, whereas the newer nodes (tg08x, tg09x) have been switched to Slurm. Please see the batch system description for general information about the two batch systems. In the following, only the features specific to TinyGPUwill be described.

Torque

To submit batch jobs to TinyGPU, you need to use qsub.tinygpu instead of the normal qsub-command on the Woody frontends.

If you do not request a specific GPU type, your job will run on any available node. If you want to request a specific GPU type, use the properties stated in the table above. For a more general selection, there are :anygtx and :anytesla (their meaning should be obvious) as well as :cuda9, and :cuda10 which tell the supported CUDA versions. Thus, e.g. qsub.tinygpu -l nodes=1:ppn=4:any1080 [...].
You may request parts of a node, e.g. if you only need one GPU. For every 4 cores you request, you are also assigned one GPU. For obvious reasons, you are only allowed to request multiples of 4 cores. As an example, if you request qsub.tinygpu -l nodes=1:ppn=8:gtx1080 [...] you will get 8 cores, 2 of the GTX 1080 GPUs and half of the main memory in one of the tg03X nodes. There is no dedicated share of the local HDD/SSD assigned.

Properties can also be used to request a certain CPU clock frequency. This is not something you will usually want to do, but it can be used for certain kinds of benchmarking. Note that you cannot make the CPUs go any faster, only slower, as the default already is the turbo mode, which makes the CPU clock as fast as it can (up to 3.2 GHz, depending on the requested configuration) without exceeding its thermal or power budget. So please do not use any of the following options unless you know what you’re doing. The available options are: :noturbo to disable Turbo Mode, :fX.X to request a specific frequency. The available frequencies for the different nodes vary.

To request access to the hardware performance counters (i.e. to use likwid-perfctr), you have to add the property :likwid and request the full node. Otherwise, you will get the error message Access to performance monitoring registers locked from likwid-perfctr. The property is not required (and should also not be used) for other parts of the LIKWID suite, e.g. it is not required for likwid-pin.

To check the status of your jobs, use qstat.tinygpu instead of the normal qstat.

Slurm

Similar command wrappers as for Torque also exist for Slurm. This means that jobs can be submitted from the woody frontend via sbatch.tinygpu. Other examples are srun.tinygpu, salloc.tinygpu, sinfo.tinygpu and squeue.tinygpu. These commands are equivalent to using the option --clusters=tinygpu. When resubmitting jobs from the compute nodes themselves, only use squeue, i.e. without the .tinygpu suffix.

In contrast to other clusters, the compute nodes are not allocated exclusively but are shared among several jobs – but the GPUs are granted exclusively. Thus, resources are allocated on a per-GPU basis. This means, that it is sufficient to request the number of GPUs you want to use, e.g. with --gres=gpu:1. The corresponding amount of cores and memory is automatically allocated. To request a specific GPU type, use e.g. --gres=gpu:rtx3080:1. For the A100 nodes, you have to additionally specify --partiton=a100 and use --gres=gpu:a100:1. If you do not request a specific GPU type, your job will run on any available node in the work partition. Jobs that do not request at least one GPU will be rejected by the scheduler. There is no “devel” queue available on TinyGPU. The maximum walltime any job can request ist 24h.

We recommend always using srun instead of mpirunor mpiexecto start your parallel application, since it automatically uses the allocated resources (number of tasks, cores per task, …) and also binds the tasks to the allocated cores. If you have to use mpirun, make sure to check that the binding of your processes is correct (e.g. with --report-bindings for OpenMPI and export I_MPI_DEBUG=4 for IntelMPI). OpenMP threads are not automatically pinned to specific cores. In order for the application to run efficiently, this has to be done manually.

The Slurm compute nodes are already running on a newer Ubuntu version than the other nodes and the woody frontend. This might cause problems since Slurm automatically exports the environment of the submit host (woody) to the job. Therefore, we recommend using the sbatch option --export=none to prevent this export. Additionally, unset SLURM_EXPORT_ENV has to be called before srun to ensure that it is executed correctly. Both options are already included in the example scripts below.

Example Slurm Batch Scripts

For the most common use cases, examples are provided below.

In this example, the executable will be run using 2 MPI processes for a total job walltime of 6 hours. The job allocates one GPU and the corresponding share of CPUs and main memory (8 cores including SMT and 48GB RAM in case of GTX3080; 32 cores without SMT and 128GB RAM in case of A100.

!/bin/bash -l
#
# start 2 MPI processes
#SBATCH --ntasks=2
# allocate nodes for 6 hours
#SBATCH --time=06:00:00
# allocated one GPU (type not specified)
#SBATCH --gres=gpu:1
# job name 
#SBATCH --job-name=Testjob
# do not export environment variables
#SBATCH --export=NONE

# do not export environment variables
unset SLURM_EXPORT_ENV

srun --mpi=pmi2 ./executable.exe

In this example, one A100 GPU is allocated. The executable will be run using 2 MPI processes with 16 OpenMP threads each for a total job walltime of 6 hours. 32 cores are allocated in total and each OpenMP thread is running on a physical core.

For a more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=coresOMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

!/bin/bash -l
#
# start 2 MPI processes
#SBATCH --ntasks=2
# requests 8 OpenMP threads per MPI task
#SBATCH --cpus-per-task=16
# allocated one GPU (type not specified)
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
# allocate nodes for 6 hours
#SBATCH --time=06:00:00
# job name 
#SBATCH --job-name=Testjob
# do not export environment variables
#SBATCH --export=NONE

# do not export environment variables
unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun --mpi=pmi2 ./executable_hybrid.exe

In this example, the executable will be run using 16 OpenMP threads for a total job walltime of 6 hours. One GTX3080 GPU and the corresponding 16 cores (including SMT) are allocated in total.

For a more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=coresOMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

!/bin/bash -l #
# requests 16 OpenMP threads
#SBATCH --cpus-per-task=16
# allocated one GPU (type not specified) 
#SBATCH --gres=gpu:gtx3080:1
# allocate nodes for 6 hours 
#SBATCH --time=06:00:00 
# job name #SBATCH --job-name=Testjob 
# do not export environment variables 
#SBATCH --export=NONE 

# do not export environment variables 
unset SLURM_EXPORT_ENV 

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
 srun --mpi=pmi2 ./executable_hybrid.exe

 

Interactive Slurm Shell

To generate an interactive Slurm shell on one of the compute nodes, the following command has to be issued on the woody frontend:
salloc --cpus-per-task=10 --gres=gpu:1 --time=00:30:00
This will give you an interactive shell for 30 minutes on one of the nodes, allocating 10 physical cores and 80000MB memory. There, you can then for example start the execution of a shared-memory parallel binary:
srun ./my_shared_memory_program.exe
It is executed using 32 threads and up to 128 GBytes of memory (the units are MBytes). Additional tuning and resource settings (e.g. OpenMP environment variables and pining) have to be performed before issuing the srun command.

 

Software

The Woody front ends only have a limited software installation with regard to GPGPU computing. It is recommended to compile code on one of the TinyGPU nodes, i.e. by requesting an interactive job on TinyGPU.

Additionally, the Slurm nodes were already upgraded to a newer Ubuntu version and therefore use different software versions and modules than the Torque nodes and the woody frontend. Applications should be recompiled on these nodes to ensure compatibility.

cuDNN is installed as a system package on the nodes – no module required.

The GTX1080/GTX1080Ti GPUs can only be used with CUDA 9.0 (or higher). The V100 may require at least CUDA 9.0. The RTX2080Ti may require at least CUDA 10.0. The A100 and RTX3080 require at least CUDA 11.x.

Host software using AVX512 instructions will only run on tg06x/tg07x/tg08x. Host software compiled specifically for Intel processors might not run on tg09x.