Scaling CUDA-Accelerated Applications

Scaling a GPU application beyond a single accelerator requires both intra-node and inter-node parallelism. This course provides a comprehensive treatment of both: part one covers CUDA streams, multi-GPU execution within a node, and direct peer-to-peer GPU memory access; part two extends that foundation across compute nodes using CUDA-aware MPI and NVSHMEM, including 1D domain decomposition and halo-exchange patterns, with copy/compute overlap as a recurring optimization. A single 2D heat-diffusion stencil serves as the running example, refined step by step from a CPU baseline through managed memory and algorithmic partitioning to distributed multi-GPU execution. Each hands-on step is provided at multiple difficulty levels, from guided starting points to full solutions.

This course was developed to replace the two formerly separate NVIDIA DLI courses Accelerating CUDA C++ Applications with Multiple GPUs and Scaling CUDA C++ Applications to Multiple Nodes which have been first on hold and then finally discontinued in 2025 and 2026.

Level: Intermediate to advanced

Language: English (German upon request for bespoke courses)

Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).

Knowledge

Experience with CUDA C++ GPU programming, including memory allocation, kernel launches, grid-stride loops, and error handling (equivalent to the Introduction to CUDA C/C++ course)
Familiarity with the Linux command line as well as compiling and running CUDA applications

Technical

A modern web browser (for JupyterHub access to NHR@FAU’s HPC clusters)
A local installation of NVIDIA Nsight Systems

After completing this course, you will be able to:

Port a CPU application to a single GPU using CUDA managed memory and prefetching
Use concurrent CUDA streams to overlap memory transfers with GPU computation
Scale CUDA C++ workloads across multiple GPUs within a single compute node
Enable and exploit direct peer-to-peer GPU memory access for efficient intra-node communication
Write portable, scalable SPMD code using CUDA-aware MPI with inter-node GPU communication
Apply NVSHMEM for GPU-initiated data transfers using the symmetric memory model
Implement domain decomposition and halo exchange patterns for distributed GPU workloads
Profile multi-GPU execution and identify performance bottlenecks with NVIDIA Nsight Systems

Motivation and the running example: a 2D heat-diffusion stencil scaled throughout the course
CPU baseline and single-GPU port: managed memory, 2D execution configuration, and prefetching
Algorithmic work partitioning: decomposing the domain into patches for multi-GPU execution
CUDA streams: concurrent per-patch execution and Nsight Systems timeline analysis
Multi-GPU within a node: device management, per-patch allocations, and halo exchange
Direct inter-GPU communication: unified virtual addressing and peer-to-peer transfers
Overlapping communication and computation with multiple streams
Multi-node parallelism with MPI: rank-to-GPU mapping, CUDA-aware MPI, and GPUDirect RDMA
NVSHMEM: the symmetric-memory model and GPU-initiated one-sided communication
Outlook: hierarchical reductions (CUB, NCCL), multi-dimensional domain decomposition, and parallel I/O

2026, Sep 7-9: three-day online course in collaboration with NHR@TUD (Register); part 2 of From Zero to Multi-Node GPU Programming

2026, Mar 10-11: two-day online course in collaboration with NHR@TUD; part 2 of From Zero to Multi-Node GPU Programming

For an overview of all NHR@FAU courses, visit the course overview page.

Scaling CUDA-Accelerated Applications

Course Details

Prerequisites

Learning Outcomes

Course Outline

Upcoming Events

Past Events (1)