Scaling CUDA C++ Applications to Multiple Nodes

GPU clusters expose the full power of distributed computing, but scaling a CUDA application beyond a single node requires explicit inter-node communication. This NVIDIA DLI course covers multi-node programming techniques for GPU-accelerated applications, with a strong emphasis on the SPMD programming model, CUDA-aware MPI for inter-node data exchange, and NVSHMEM for fine-grained GPU-initiated communication. Canonical patterns such as domain decomposition and halo exchanges are discussed and implemented.

Prior attendance of Accelerating CUDA C++ Applications with Multiple GPUs is recommended.

Further information about this tutorial can be found on the NVIDIA DLI course page.

This course was first on hold and then finally discontinued in 2025 and 2026. NHR@FAU offers Scaling CUDA-Accelerated Applications as an alternative that covers both multi-GPU and multi-node content in a single, streamlined tutorial.

Level: Advanced

Language: English (German upon request for bespoke courses)

Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).

Knowledge

Experience with CUDA C++ GPU programming, including memory allocation, kernel launches, grid-stride loops, and error handling (equivalent to the Introduction to CUDA C/C++ course)
Prior attendance of Accelerating CUDA C++ Applications with Multiple GPUs is recommended.
Familiarity with the Linux command line and compilation using Makefiles
Prior knowledge about distributed memory programming with MPI is helpful but not strictly required

Technical

An up-to-date browser for accessing the course materials and online environment
A free NVIDIA developer account
A local installation of NVIDIA Nsight Systems is recommended

After completing this course, you will be able to:

Apply multiple multi-GPU communication patterns and reason about their trade-offs
Write portable, scalable SPMD code using CUDA-aware MPI for inter-node GPU communication
Use NVSHMEM’s symmetric memory model to enable GPU-initiated data transfers
Implement domain decomposition and halo exchange patterns for distributed GPU workloads
Scale a CUDA C++ application from a single GPU to multiple nodes

Multi-GPU programming paradigms: peer-to-peer communication, SPMD with CUDA-aware MPI
Introduction to NVSHMEM: symmetric memory, GPU-initiated transfers, multi-GPU SPMD code
Halo exchanges with NVSHMEM: domain decomposition, Jacobi solver, and 1D wave equation

2025, Sep 17-18: two half-day online course in collaboration with NHR@TUD (From Zero to Multi-Node GPU Programming); part 3 of From Zero to Multi-Node GPU Programming
2025, Mar 26: full-day online course in collaboration with NHR@TUD, EUMaster4HPC (From Zero to Multi-Node GPU Programming); part 3 of From Zero to Multi-Node GPU Programming
2024, Oct 2: full-day online course in collaboration with NHR@TUD (From Zero to Multi-Node GPU Programming); part 3 of From Zero to Multi-Node GPU Programming
2024, Apr 10: full-day online course (Multi-GPU Programming with CUDA C++); part 2 of Multi-GPU Programming with CUDA C++
2024, Feb 9: full-day online course

For an overview of all NHR@FAU courses, visit the course overview page.

Scaling CUDA C++ Applications to Multiple Nodes

Course Details

Prerequisites

Learning Outcomes

Course Outline

Past Events (5)