Working with NVIDIA GPUs
NVIDIA compiler and libraries
The CUDA compilers are part of the
cuda modules. Loading the appropriate module (e.g. cuda/11.2) will not only sets the path to the Nvidia CUDA compilers but also e.g.
CUDA_INSTALL_PATH which might be used in Makefiles, etc.
The Nvidia (formerly PGI) compilers are part of the
GPU statistics in job output
Slurm saves the standard output stream by default into a file in the working directory and the filename is automatically compiled from the job name and the job ID. Statistics on GPU utilization are added at the very end of this file. Each CUDA binary call prints a line with information on GPU name, bus ID, process ID, GPU and memory utilization, maximum memory usage and overall execution time.
The output will look like this:
=== GPU utilization === gpu_name, gpu_bus_id, pid, gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms] NVIDIA GeForce RTX 3080, 00000000:1A:00.0, 134883, 92 %, 11 %, 395 MiB, 244633 ms NVIDIA GeForce RTX 3080, 00000000:1A:00.0, 135412, 92 %, 11 %, 395 MiB, 243797 ms
In this example, two CUDA binary calls happened; both were running on the same GPU (00000000:1A:00.0). The average GPU utilization was 92%, 11% of the GPU memory or 395 MiB have been used and each binary run for about 244 seconds.
NVIDIA System Management Interface
The System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
nvidia-smi provides monitoring and management capabilities for each of NVIDIA’s Tesla, Quadro, GRID and GeForce devices from Fermi and higher architecture families.
Using nvidia-smi on our clusters
sshto the node where the job runs; if you have multiple jobs running on the same node, you will be placed in the allocation of the job which has most recently started a new jobstep (either by starting the job or by calling
srun); currently, this cannot be changed
nvidia-smito see GPU utilization
The output of
nvidia-smi will look similar to the picture on the right. The upper part contains information about the GPU and provides the percentage of GPU utilization in the bottom right cell of the table; the lower part lists the processes that are running on the GPU and shows how much GPU memory is used. The device numbers for GPU jobs always starts with 0 as can be seen in the bottom left cell of the table because each job is treated on its own. Thus, in case you contact us for bug reports or need general help, please include the jobID and the GPU busID from the middle cell of the table to your message.
nvtop: GPU status viewer
Nvtop stands for Neat Videocard TOP, a
(h)top like task monitor for AMD and NVIDIA GPUs. It can handle multiple GPUs and print information about them in a
htop familiar way. It provides information on the GPU states (GPU and memory utilization, temperature, etc) as well as information about the processes executing on the GPUs.
nvtop is available as a module on Alex and TinyGPU.
NVIDIA Multi-Process Service
The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs. This can benefit performance when the GPU compute capacity is underutilized by a single application process.
Using MPS with single-GPU jobs
# set necessary environment variables and start the MPS daemon export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps.$SLURM_JOB_ID export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log.$SLURM_JOB_ID nvidia-cuda-mps-control -d # do your work (a.out is just a placeholder) ./a.out -param 1 & ./a.out -param 2 & ./a.out -param 3 & ./a.out -param 4 & wait # stop the MPS daemon echo quit | nvidia-cuda-mps-control
GPU-Profiling with NVIDIA tools
NVIDIA offers two prominent profiling tools: Nsight Systems which targets profiling whole applications and Nsight Compute which allows zeroing in on specific performance characteristics of single kernels.
An overview of application behavior can be obtained by running
nsys profile ./a.out
transferring the resulting report file to your local machine and opening it with a local installation of Nsight Systems. More command line options are available, as specified in the documentation. Some of the most relevant ones are
--stats=true --force-overwrite=true -o my-profile
Stats summarizes obtained performance data after the application has finished and prints this summary to the command line. -o specifies the target output file name for the generated report file (my-profile in this example). Force overwrite advises the profiler to overwrite the report file should it already exist.
A full example could be
nsys profile --stats=true --force-overwrite=true -o my-profile ./a.out
Important: The resulting report files can grow quite large, depending on the application examined. Please make sure to use the appropriate file systems.
After getting an execution time overview, more in-depth analysis can be carried out by using Nsight Compute via
which by default profiles all kernels in the application. This can be finetuned by providing options such as
--launch-skip 2 --launch-count 1
to skip the first two kernel launches and limit the number of profiled kernels to 1. Profiling can also be limited to specific kernels using
with an assumed kernel name of my_kernel. In most cases, specifying metrics to be measured is recommended as well, e.g. with
for the data volumes read and written from and to the GPU’s main memory. Further information on available metrics can be found here and some key metrics are listed here.
Other command line options can be reviewed in the documentation.
A full profiling call could be
ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 --metrics dram__bytes_read.sum,dram__bytes_write.sum ./a.out
LIKWID is a powerful performance tools and library suite for performance-oriented programmers and administrators using the GNU Linux operating system. For example, likwid-topology can be used to display the thread and cache topology on multicore/multisocket computers, likwid-perfctr is a tool to measure hardware performance counters on recent Intel and AMD processors, and likwid-pin allows you to pin your threaded application without changing your code.
LIKWID 5.0 also supports NVIDIA GPUs. In order to simplify the transition from CPUs to GPUs for the users, the LIKWID API for GPUs is basically a copy of the LIKWID API for CPUs with a few differences. For the command line applications, new CLI options are introduced. A tutorial on how to use LIKWID with NVIDIA GPUs can be found on the LIKWID github page.