Scaling CUDA-Accelerated Applications

Course Description

This advanced course explores techniques for extending single-GPU applications to utilize multiple GPUs within a single compute node, and across multiple GPU-equipped nodes. It discusses work partitioning approaches, data sharing and exchange strategies, and overlapping of computation and data transfers. Key technologies covered are CUDA streams, CUDA device management, GPUDirect P2P, (CUDA-aware) MPI, and NVSHMEM. All modules are supported by hands-on exercises in multiple levels of difficulty, as well as profiling with Nsight Systems to explore parallel behavior of implemented applications.

Learning Objectives

At the conclusion of the workshop, you will be able to:

Use concurrent CUDA streams to overlap
- memory transfers with GPU computation,
- GPU-to-GPU communication with GPU computation, and
- MPI communication with GPU computation.
Use several methods for writing multi-GPU CUDA C++ applications.
Scale workloads across available GPUs on a single node.
Scale workloads across available GPUs on multiple nodes.
Utilize the NVIDIA Nsight Systems timeline to identify improvement opportunities and assess the impact of the techniques covered in the workshop.
Write portable, scalable CUDA code with the single-program multiple-data (SPMD) paradigm using CUDA-aware MPI and NVSHMEM.
Apply common multi-GPU coding paradigms like domain decomposition and halo exchanges.

Course Structure

CPU Baseline and first GPU port using CUDA managed memory.
Work partitioning on an algorithmic level.
CUDA streams for parallel operations.
Multi-GPU execution and work distribution.
Halo-exchanges for improved data locality.
P2P communication for optimized transfers between GPUs.
Overlapping communication and computation.
MPI ‘hello world’ and MPI through the lens of a CUDA programmer.
MPI parallelization for multi-node parallelism.
Overlapping MPI exchanges and computation.
NVSHMEM as an alternative.

Certification

A digital certificate of attendance will be awarded to all participants who attended the majority of the course.

Prerequisites

Participants should meet the following requirements:

Successful completion of Fundamentals of Accelerated Computing with CUDA C/C++, or equivalent experience in implementing CUDA C/C++ applications, including:
- Memory allocation
- Host-to-device and device-to-host memory transfers
- Kernel launches
- Grid-stride loops
- CUDA error handling
Familiarity with the Linux command line
Experience compiling and executing CUDA applications

Upcoming Iterations and Additional Courses

You can find dates and registration links for this and other upcoming NHR@FAU courses at https://go-nhr.de/trainings .