• Skip navigation
  • Skip to navigation
  • Skip to the bottom
Simulate organization breadcrumb open Simulate organization breadcrumb close
  • FAUTo the central FAU website
  • RRZE
  • NHR-Verein e.V.
  • Gauß-Allianz

Navigation Navigation close
  • News
  • About us
    • People
    • Funding
    • NHR Compute Time Projects
    • Tier3 User Project Reports
    • Success Stories from the Support
    • Annual Report
    • Jobs
    Portal About us
  • Research
    • Research Focus
    • Publications, Posters and Talks
    • Software & Tools
    • HPC Performance Lab
    • Atomic Structure Simulation Lab
    • NHR PerfLab Seminar
    • Projects
    • Awards
    Portal Research
  • Teaching & Training
    • Lectures and Seminars
    • Tutorials & Courses
    • Theses
    • HPC Café
    • Student Cluster Competition
    Portal Teaching & Training
  • Systems & Services
    • Systems, Documentation & Instructions
    • Support & Contact
    • Training Resources
    • Summary of System Utilization
    Portal Systems & Services
  • FAQ

  1. Home
  2. Systems & Services
  3. Systems, Documentation & Instructions
  4. Special applications, and tips & tricks
  5. Working with NVIDIA GPUs

Working with NVIDIA GPUs

In page navigation: Systems & Services
  • Systems, Documentation & Instructions
    • Getting started with HPC
      • NHR@FAU HPC-Portal Usage
    • Job monitoring with ClusterCockpit
    • NHR application rules – NHR@FAU
    • HPC clusters & systems
      • Dialog server
      • Alex GPGPU cluster (NHR+Tier3)
      • Fritz parallel cluster (NHR+Tier3)
      • Meggie parallel cluster (Tier3)
      • Emmy parallel cluster (Tier3)
      • Woody(-old) throughput cluster (Tier3)
      • Woody throughput cluster (Tier3)
      • TinyFat cluster (Tier3)
      • TinyGPU cluster (Tier3)
      • Test cluster
      • Jupyterhub
    • SSH – Secure Shell access to HPC systems
    • File systems
    • Batch Processing
      • Job script examples – Slurm
      • Advanced topics Slurm
    • Software environment
    • Special applications, and tips & tricks
      • Amber/AmberTools
      • ANSYS CFX
      • ANSYS Fluent
      • ANSYS Mechanical
      • Continuous Integration / Gitlab Cx
        • Continuous Integration / One-way syncing of GitHub to Gitlab repositories
      • CP2K
      • CPMD
      • GROMACS
      • IMD
      • Intel MKL
      • LAMMPS
      • Matlab
      • NAMD
      • OpenFOAM
      • ORCA
      • Python and Jupyter
      • Quantum Espresso
      • R and R Studio
      • Spack package manager
      • STAR-CCM+
      • Tensorflow and PyTorch
      • TURBOMOLE
      • VASP
        • Request access to central VASP installation
      • Working with NVIDIA GPUs
      • WRF
  • Support & Contact
    • HPC Performance Lab
    • Atomic Structure Simulation Lab
  • HPC User Training
  • HPC System Utilization

Working with NVIDIA GPUs

NVIDIA compiler and libraries

The CUDA compilers are part of the cuda modules. Loading the appropriate module (e.g. cuda/11.2) will not only sets the path to the Nvidia CUDA compilers but also e.g. CUDA_HOME or CUDA_INSTALL_PATH which might be used in Makefiles, etc.

The Nvidia (formerly PGI) compilers are part of the nvhpc modules.

GPU statistics in job output

Slurm saves the standard output stream by default into a file in the working directory and the filename is automatically compiled from the job name and the job ID. Statistics on GPU utilization are added at the very end of this file. Each CUDA binary call prints a line with information on GPU name, bus ID, process ID, GPU and memory utilization, maximum memory usage and overall execution time.

The output will look like this:

=== GPU utilization ===
gpu_name, gpu_bus_id, pid, gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]
NVIDIA GeForce RTX 3080, 00000000:1A:00.0, 134883, 92 %, 11 %, 395 MiB, 244633 ms
NVIDIA GeForce RTX 3080, 00000000:1A:00.0, 135412, 92 %, 11 %, 395 MiB, 243797 ms

In this example, two CUDA binary calls happened; both were running on the same GPU (00000000:1A:00.0). The average GPU utilization was 92%, 11% of the GPU memory or 395 MiB have been used and each binary run for about 244 seconds.

NVIDIA System Management Interface

The System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. nvidia-smi provides monitoring and management capabilities for each of NVIDIA’s Tesla, Quadro, GRID and GeForce devices from Fermi and higher architecture families.

Using nvidia-smi on our clusters

  1. ssh to the node where the job runs; if you have multiple jobs running on the same node: output of nvidia-smi
    1. On Alex: use srun --jobid=<jobid> --overlap --pty /bin/bash instead of ssh on the frontend to attach to a specific job
    2. On TinyGPU: you will be placed in the allocation of the job which has most recently started a new jobstep (either by starting the job or by calling srun)
  2. type nvidia-smi to see GPU utilization

The output of nvidia-smi will look similar to the picture on the right. The upper part contains information about the GPU and provides the percentage of GPU utilization in the bottom right cell of the table; the lower part lists the processes that are running on the GPU and shows how much GPU memory is used. The device numbers for GPU jobs always starts with 0 as can be seen in the bottom left cell of the table because each job is treated on its own. Thus, in case you contact us for bug reports or need general help, please include the jobID and the GPU busID from the middle cell of the table to your message.

nvtop: GPU status viewer

Nvtop stands for Neat Videocard TOP, a (h)top like task monitor for AMD and NVIDIA GPUs. It can handle multiple GPUs and print information about them in a htop familiar way. It provides information on the GPU states (GPU and memory utilization, temperature, etc) as well as information about the processes executing on the GPUs. nvtop is available as a module on Alex and TinyGPU.

NVIDIA Multi-Process Service

The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs. This can benefit performance when the GPU compute capacity is underutilized by a single application process.

Using MPS with single-GPU jobs

# set necessary environment variables and start the MPS daemon
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps.$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log.$SLURM_JOB_ID
nvidia-cuda-mps-control -d

# do your work (a.out is just a placeholder)
./a.out -param 1 &
./a.out -param 2 & 
./a.out -param 3 & 
./a.out -param 4 & 
wait

# stop the MPS daemon
echo quit | nvidia-cuda-mps-control

Using MPS with multi-GPU jobs

# set necessary environment variables and start the MPS daemon
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
    echo "starting mps server for $GPU"
    export CUDA_VISIBLE_DEVICES=$GPU
    export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
    export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log-${GPU}.$SLURM_JOB_ID
    nvidia-cuda-mps-control -d
done
# do your work - you may need to set CUDA_MPS_PIPE_DIRECTORY correctly per process!!
...
# cleanup MPS
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
    echo "stopping mps server for $GPU"
    export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
    echo 'quit' | nvidia-cuda-mps-control
done

See also http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html and https://stackoverflow.com/questions/36015005/cuda-mps-servers-fail-to-start-on-workstation-with-multiple-gpus.

GPU-Profiling with NVIDIA tools

NVIDIA offers two prominent profiling tools: Nsight Systems which targets profiling whole applications and Nsight Compute which allows zeroing in on specific performance characteristics of single kernels.

 

An overview of application behavior can be obtained by running

nsys profile ./a.out

transferring the resulting report file to your local machine and opening it with a local installation of Nsight Systems. More command line options are available, as specified in the documentation. Some of the most relevant ones are

--stats=true --force-overwrite=true -o my-profile

Stats summarizes obtained performance data after the application has finished and prints this summary to the command line. -o specifies the target output file name for the generated report file (my-profile in this example). Force overwrite advises the profiler to overwrite the report file should it already exist.

A full example could be

nsys profile --stats=true --force-overwrite=true -o my-profile ./a.out

Important: The resulting report files can grow quite large, depending on the application examined. Please make sure to use the appropriate file systems.

 

After getting an execution time overview, more in-depth analysis can be carried out by using Nsight Compute via

ncu ./a.out

which by default profiles all kernels in the application. This can be finetuned by providing options such as

--launch-skip 2 --launch-count 1

to skip the first two kernel launches and limit the number of profiled kernels to 1. Profiling can also be limited to specific kernels using

--kernel-name my_kernel

with an assumed kernel name of my_kernel. In most cases, specifying metrics to be measured is recommended as well, e.g. with

--metrics dram__bytes_read.sum,dram__bytes_write.sum

for the data volumes read and written from and to the GPU’s main memory. Further information on available metrics can be found here and some key metrics are listed here.
Other command line options can be reviewed in the documentation.

A full profiling call could be

ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 --metrics dram__bytes_read.sum,dram__bytes_write.sum ./a.out

LIKWID

LIKWID is a powerful performance tools and library suite for performance-oriented programmers and administrators using the GNU Linux operating system. For example, likwid-topology can be used to display the thread and cache topology on multicore/multisocket computers, likwid-perfctr is a tool to measure hardware performance counters on recent Intel and AMD processors, and likwid-pin allows you to pin your threaded application without changing your code.

LIKWID 5.0 also supports NVIDIA GPUs. In order to simplify the transition from CPUs to GPUs for the users, the LIKWID API for GPUs is basically a copy of the LIKWID API for CPUs with a few differences. For the command line applications, new CLI options are introduced. A tutorial on how to use LIKWID with NVIDIA GPUs can be found on the LIKWID GitHub page.

Erlangen National High Performance Computing Center (NHR@FAU)
Martensstraße 1
91058 Erlangen
Germany
  • Imprint
  • Privacy
  • Accessibility
  • How to find us
Up