Scaling CUDA C++ Applications to Multiple Nodes

GPU clusters expose the full power of distributed computing, but scaling a CUDA application beyond a single node requires explicit inter-node communication. This NVIDIA DLI course covers multi-node programming techniques for GPU-accelerated applications, with a strong emphasis on the SPMD programming model, CUDA-aware MPI for inter-node data exchange, and NVSHMEM for fine-grained GPU-initiated communication. Canonical patterns such as domain decomposition and halo exchanges are discussed and implemented.

Prior attendance of Accelerating CUDA C++ Applications with Multiple GPUs is recommended.

Further information about this tutorial can be found on the NVIDIA DLI course page.

This course was first on hold and then finally discontinued in 2025 and 2026. NHR@FAU offers Scaling CUDA-Accelerated Applications as an alternative that covers both multi-GPU and multi-node content in a single, streamlined tutorial.

Level: Advanced

Language: English (German upon request for bespoke courses)

Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).

Knowledge

  • Experience with CUDA C++ GPU programming, including memory allocation, kernel launches, grid-stride loops, and error handling (equivalent to the Introduction to CUDA C/C++ course)
  • Prior attendance of Accelerating CUDA C++ Applications with Multiple GPUs is recommended.
  • Familiarity with the Linux command line and compilation using Makefiles
  • Prior knowledge about distributed memory programming with MPI is helpful but not strictly required

Technical

  • An up-to-date browser for accessing the course materials and online environment
  • A free NVIDIA developer account
  • A local installation of NVIDIA Nsight Systems is recommended

After completing this course, you will be able to:

  • Apply multiple multi-GPU communication patterns and reason about their trade-offs
  • Write portable, scalable SPMD code using CUDA-aware MPI for inter-node GPU communication
  • Use NVSHMEM’s symmetric memory model to enable GPU-initiated data transfers
  • Implement domain decomposition and halo exchange patterns for distributed GPU workloads
  • Scale a CUDA C++ application from a single GPU to multiple nodes

  • Multi-GPU programming paradigms: peer-to-peer communication, SPMD with CUDA-aware MPI
  • Introduction to NVSHMEM: symmetric memory, GPU-initiated transfers, multi-GPU SPMD code
  • Halo exchanges with NVSHMEM: domain decomposition, Jacobi solver, and 1D wave equation

  • 2025, Sep 17-18: two half-day online course in collaboration with NHR@TUD (From Zero to Multi-Node GPU Programming); part 3 of From Zero to Multi-Node GPU Programming
  • 2025, Mar 26: full-day online course in collaboration with NHR@TUD, EUMaster4HPC (From Zero to Multi-Node GPU Programming); part 3 of From Zero to Multi-Node GPU Programming
  • 2024, Oct 2: full-day online course in collaboration with NHR@TUD (From Zero to Multi-Node GPU Programming); part 3 of From Zero to Multi-Node GPU Programming
  • 2024, Apr 10: full-day online course (Multi-GPU Programming with CUDA C++); part 2 of Multi-GPU Programming with CUDA C++
  • 2024, Feb 9: full-day online course

For an overview of all NHR@FAU courses, visit the course overview page.