• Skip navigation
  • Skip to navigation
  • Skip to the bottom
Simulate organization breadcrumb open Simulate organization breadcrumb close
  • FAUTo the central FAU website
  • RRZE
  • NHR-Verein e.V.
  • Gauß-Allianz

Navigation Navigation close
  • News
  • About us
    • People
    • Funding
    • NHR Compute Time Projects
    • Tier3 User Project Reports
    • Success Stories from the Support
    • Annual Report
    • Jobs
    Portal About us
  • Research
    • Research Focus
    • Publications, Posters and Talks
    • Software & Tools
    • HPC Performance Lab
    • Atomic Structure Simulation Lab
    • NHR PerfLab Seminar
    • Projects
    • Awards
    Portal Research
  • Teaching & Training
    • Lectures and Seminars
    • Tutorials & Courses
    • Theses
    • HPC Café
    • Student Cluster Competition
    Portal Teaching & Training
  • Systems & Services
    • Systems, Documentation & Instructions
    • Support & Contact
    • Training Resources
    • Summary of System Utilization
    Portal Systems & Services
  • FAQ

  1. Home
  2. Systems & Services
  3. Systems, Documentation & Instructions
  4. HPC clusters & systems
  5. TinyGPU cluster (Tier3)

TinyGPU cluster (Tier3)

In page navigation: Systems & Services
  • Systems, Documentation & Instructions
    • Getting started with HPC
      • NHR@FAU HPC-Portal Usage
    • Job monitoring with ClusterCockpit
    • NHR application rules – NHR@FAU
    • HPC clusters & systems
      • Dialog server
      • Alex GPGPU cluster (NHR+Tier3)
      • Fritz parallel cluster (NHR+Tier3)
      • Meggie parallel cluster (Tier3)
      • Emmy parallel cluster (Tier3)
      • Woody(-old) throughput cluster (Tier3)
      • Woody throughput cluster (Tier3)
      • TinyFat cluster (Tier3)
      • TinyGPU cluster (Tier3)
      • Test cluster
      • Jupyterhub
    • SSH – Secure Shell access to HPC systems
    • File systems
    • Batch Processing
      • Job script examples – Slurm
      • Advanced topics Slurm
    • Software environment
    • Special applications, and tips & tricks
      • Amber/AmberTools
      • ANSYS CFX
      • ANSYS Fluent
      • ANSYS Mechanical
      • Continuous Integration / Gitlab Cx
        • Continuous Integration / One-way syncing of GitHub to Gitlab repositories
      • CP2K
      • CPMD
      • GROMACS
      • IMD
      • Intel MKL
      • LAMMPS
      • Matlab
      • NAMD
      • OpenFOAM
      • ORCA
      • Python and Jupyter
      • Quantum Espresso
      • R and R Studio
      • Spack package manager
      • STAR-CCM+
      • Tensorflow and PyTorch
      • TURBOMOLE
      • VASP
        • Request access to central VASP installation
      • Working with NVIDIA GPUs
      • WRF
  • Support & Contact
    • HPC Performance Lab
    • Atomic Structure Simulation Lab
  • HPC User Training
  • HPC System Utilization

TinyGPU cluster (Tier3)

The TinyGPU cluster is one of a group of small special-purpose clusters. TinyGPU is intended for running applications utilizing GPU accelerators.

On 19.09.2022, the name of the frontend node for TinyGPU has been changed to tinyx.nhr.fau.de.

There are a number of different machines with different types of GPUs (mostly of consumer type) in TinyGPU.

Hostnames #
nodes (GPUs)
CPUs and number of cores per machine, main memory GPU Type
(per node)
Additional comments
tg060-tg06b
(not all nodes generally available)
12 (48) 2x Intel Xeon Gold 6134 (“Skylake”) @3.2 GHz
= 16 cores with optional SMT; 96 GB RAM
4x NVIDIA RTX 2080 Ti (11 GB memory) local SATA-SSD (1.8 TB) available under /scratchssd
tg071-tg074
(not always generally available)
4 (16) 2x Intel Xeon Gold 6134 (“Skylake”) @3.2 GHz
= 16 cores with optional SMT; 96 GB RAM
4x NVIDIA Tesla V100 (32GB memory) local NVMe-SSD (2.9 TB) available under /scratchssd
tg080-tg086
(not always generally available)
7 (56) 2x Intel Xeon Gold 6226R @2.9 GHz
= 32 cores with optional SMT; 384 GB RAM
8x NVIDIA Geforce RTX3080 (10GB memory; in PCIe3x16 slot) local SATA-SSD (3.8 TB) available under /scratchssd
tg090-tg097
(not always generally available)
8 (32) 2x AMD Rome 7662 @2.0 GHz
= 128 cores/SMT off; 512 GB RAM
4x NVIDIA A100 SXM4/Nvlink (40GB memory) local NVMe-SSD (5.8 TB) available under /scratchssd

 

45 out of the 48 nodes have been purchased by specific groups or special projects. These users have priority access and nodes may be reserved exclusively for them.

On the Slurm nodes, “owner-claim” is implemented which means that the owners can queue high priority jobs which then displace regular jobs, i.e. normal jobs will be killed without further notice by the system to free resources for the “owner-claim”.

Access, User Environment, and File Systems

Access to the machines

TinyGPU shares a frontend node with TinyFat. To access the systems, please connect to tinyx.nhr.fau.de via ssh.

On 19.09.2022, the name of the frontend node for TinyGPU has been changed to tinyx.nhr.fau.de.

While it is possible to ssh directly to a compute node, a user is only allowed to do this when they have a batch job running there. When all batch jobs of a user on a node have ended, all of their shells will be killed automatically.

File Systems

The following table summarizes the available file systems and their features. Also, check the description of the HPC file systems.

File system overview for the Woody cluster
Mount point Access via Purpose Technology, size Backup Data lifetime Quota
/home/hpc $HOME Storage of source, input, important results central servers YES + Snapshots Account lifetime YES (restrictive)
/home/vault $HPCVAULT Mid- to long-term high-quality storage central servers YES + Snapshots Account lifetime YES
/home/{woody, saturn, titan, janus, atuin} $WORK storage for small files NFS limited Account lifetime YES
/scratchssd $TMPDIR Temporary job data directory Node-local SSD, between 880 GB and 5.8 TB NO Job runtime NO

Node-local storage $TMPDIR

Each node has at least 880 GB of local SSD capacity for temporary files available under $TMPDIR (also accessible via /scratchssd). All files in these directories will be deleted at the end of a job without any notification. Important data to be kept can be copied to a cluster-wide volume at the end of the job, even if the job is canceled by a time limit.

Please only use the node-local SSDs if you can really profit from their use, as like all consumer SSDs they only support a limited number of writes, so in other words, by writing to them, you “use them up”.

Batch Processing

All user jobs must be submitted to the cluster by means of the batch system The submitted jobs are routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme.

All nodes on TinyGPU currently use Slurm as a batch system. Please see the batch system description for general information about how to use Slurm. In the following, only the features specific to TinyGPU will be described.

Slurm (RTX2080Ti, RTX3080, V100 and A100 nodes)

To specify to which cluster jobs should be submitted, command wrappers are available for most Slurm commands. This means that jobs can be submitted from the woody frontend via sbatch.tinygpu. Other examples are srun.tinygpu, salloc.tinygpu, sinfo.tinygpu and squeue.tinygpu. These commands are equivalent to using the option --clusters=tinygpu. When resubmitting jobs from the compute nodes themselves, only use sbatch, i.e. without the .tinygpu suffix.

In contrast to other clusters, the compute nodes are not allocated exclusively but are shared among several jobs – the GPUs themselves are always granted exclusively. Resources are granted on a per-GPU basis. The corresponding share of the resources of the host system (CPU cores, RAM) is automatically allocated. This means, that it is sufficient to request the number of GPUs you want to use, e.g. with --gres=gpu:1. To request a specific GPU type, use e.g. --gres=gpu:rtx3080:1. For the V100 and A100 nodes, you have to additionally specify --partition=v100 or --partition=a100 and use --gres=gpu:v100:1 or --gres=gpu:a100:1. If you do not request a specific GPU type, your job will run on any available node in the work partition (i.e. RTX2080Ti or RTX3080 GPUs). Jobs that do not request at least one GPU will be rejected by the scheduler. There is no “devel” queue available on TinyGPU. The maximum walltime any job can request is 24h. In contrast to the Torque GPU nodes, there is currently no way to access hardware performance counters on the Slurm GPU nodes.

Partitions on the TinyGPU cluster
Partition min – max walltime GPU type / GPU memory min – max GPUs CPU cores per GPU Host memory per GPU Slurm Option ( with # as the number of requested GPUs)
work (default) 0 – 24:00:00 NVIDIA RTX 2080 Ti (11 GB RAM) / NVIDIA Geforce RTX3080 (10GB RAM) 1-4 /

1-8

8 (with SMT) 22 GB --gres=gpu:#
or
--gres=gpu:rtx3080:#
or
--gres=gpu:rtx2080ti:#
rtx3080 0 – 24:00:00 NVIDIA Geforce RTX3080 (10GB RAM) 1-8 8 (with SMT) 46 GB --gres=gpu:#
and
--partition=rtx3080
a100 0 – 24:00:00 NVIDIA A100 SXM4/Nvlink (40GB RAM) 1-4 32 (without SMT) 117 GB --gres=gpu:a100:#
and
--partition=a100
v100 0 – 24:00:00 NVIDIA Tesla V100 (32GB RAM) 1-4 8 (with SMT) 22 GB --gres=gpu:v100:#
and
--partition=v100

We recommend always using srun instead of mpirunor mpiexecto start your parallel application, since it automatically uses the allocated resources (number of tasks, cores per task, …) and also binds the tasks to the allocated cores. If you have to use mpirun, make sure to check that the binding of your processes is correct (e.g. with --report-bindings for OpenMPI and export I_MPI_DEBUG=4 for IntelMPI). OpenMP threads are not automatically pinned to specific cores. In order for the application to run efficiently, this has to be done manually.

Since September 5th, 2022, the woody frontend woody3 runs the same Ubuntu version as the Slurm compute nodes of TinyGPU.

An interactive job is still recommended to compile software GPU software for the Slurm nodes.

We still recommend using the sbatch option --export=none to prevent this export. Additionally, unset SLURM_EXPORT_ENV has to be called before srun to ensure that it is executed correctly. Both options are already included in the example scripts below.[/notice-hinweis]

Example Slurm Batch Scripts (RTX2080Ti, RTX3080, V100 and A100 nodes only)

For the most common use cases, examples are provided below. Note that these scripts possibly have to be adapted to your specific application and use case!

MPI

In this example, the executable will be run using 2 MPI processes for a total job walltime of 6 hours. The job allocates one GPU and the corresponding share of CPUs and main memory (e.g. 8 cores including SMT and 48GB RAM in case of GTX3080)

#!/bin/bash -l
#
# start 2 MPI processes
#SBATCH --ntasks=2
# allocate nodes for 6 hours
#SBATCH --time=06:00:00
# allocated one GPU (type not specified)
#SBATCH --gres=gpu:1
# job name 
#SBATCH --job-name=Testjob
# do not export environment variables
#SBATCH --export=NONE

# do not export environment variables
unset SLURM_EXPORT_ENV

srun --mpi=pmi2 ./executable.exe

Hybrid MPI/OpenMP

In this example, one A100 GPU is allocated. The executable will be run using 2 MPI processes with 16 OpenMP threads each for a total job walltime of 6 hours. 32 cores are allocated in total and each OpenMP thread is running on a physical core.

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

#!/bin/bash -l
#
# start 2 MPI processes
#SBATCH --ntasks=2
# requests 8 OpenMP threads per MPI task
#SBATCH --cpus-per-task=16
# allocated one A100 GPGPU
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
# allocate nodes for 6 hours
#SBATCH --time=06:00:00
# job name 
#SBATCH --job-name=Testjob
# do not export environment variables
#SBATCH --export=NONE

# do not export environment variables
unset SLURM_EXPORT_ENV

# cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
# set number of threads to requested cpus-per-task 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun --mpi=pmi2 ./executable_hybrid.exe

OpenMP Job

In this example, the executable will be run using 8 OpenMP threads for a total job walltime of 6 hours. One GTX3080 GPU and the corresponding 8 cores (including SMT) are allocated in total.

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

#!/bin/bash -l #
# requests 8 OpenMP threads
#SBATCH --cpus-per-task=8
# allocated one GTX3080 GPU
#SBATCH --gres=gpu:gtx3080:1
# allocate nodes for 6 hours 
#SBATCH --time=06:00:00 
# job name #SBATCH --job-name=Testjob 
# do not export environment variables 
#SBATCH --export=NONE 

# do not export environment variables 
unset SLURM_EXPORT_ENV 

# cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
# set number of threads to requested cpus-per-task export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
 srun --mpi=pmi2 ./executable_hybrid.exe

 

Interactive Slurm Shell (RTX2080Ti, RTX3080, V100 and A100 nodes only)

To generate an interactive Slurm shell on one of the compute nodes, the following command has to be issued on the frontend:
salloc.tinygpu --gres=gpu:1 --time=00:30:00
This will give you an interactive shell for 30 minutes on one of the nodes, allocating 1 GPU and the respective number of CPU cores. There, you can then for example compile your code or do test runs of your binary. For MPI-parallel binaries, use sruninstead of mpirun.

Please note that sallocautomatically exports the environment of your shell on the login node to your interactive job. This can cause problems if you have loaded any modules due to the version differences between the frontend and the TinyGPU compute nodes. To mitigate this, purge all loaded modules via module purge before issuing the salloc command.

Attach to a running job

On the frontend node, the following steps are necessary:

  1. Check on which node the job is running with squeue.tinygpu.
  2. If you have only one job running on a node, you can use ssh <nodename> to connect to it. You will be placed in the allocation of the job.
  3. If you have multiple jobs running on a node, you can attach to a running job srun --pty --jobid YOUR-JOBID bash. This will give you a shell on the first node of your job and you can run top, nvidia-smi, etc. to check your job.

Attaching to a running job can be used e.g. to check GPU utilization via nvidia-smi. For more information on nvidia-smi and GPU profiling, see Working with NVIDIA GPUs.

Software

The frontend only has a limited software installation with regard to GPGPU computing. It is recommended to compile code on one of the TinyGPU nodes, i.e. by requesting an interactive job on TinyGPU.

cuDNN is installed as a system package on the nodes – no module required.

The V100 may require at least CUDA 9.0. The RTX2080Ti may require at least CUDA 10.0. The A100 and RTX3080 require at least CUDA 11.x.

Host software using AVX512 instructions will only run on tg06x/tg07x/tg08x. Host software compiled specifically for Intel processors might not run on tg09x.

Erlangen National High Performance Computing Center (NHR@FAU)
Martensstraße 1
91058 Erlangen
Germany
  • Imprint
  • Privacy
  • Accessibility
  • How to find us
Up