Scaling a GPU application beyond a single accelerator requires both intra-node and inter-node parallelism. This course provides a comprehensive treatment of both: part one covers CUDA streams, multi-GPU execution within a node, and direct peer-to-peer GPU memory access; part two extends that foundation across compute nodes using CUDA-aware MPI and NVSHMEM, including 1D domain decomposition and halo-exchange patterns, with copy/compute overlap as a recurring optimization. A single 2D heat-diffusion stencil serves as the running example, refined step by step from a CPU baseline through managed memory and algorithmic partitioning to distributed multi-GPU execution. Each hands-on step is provided at multiple difficulty levels, from guided starting points to full solutions.
This course was developed to replace the two formerly separate NVIDIA DLI courses Accelerating CUDA C++ Applications with Multiple GPUs and Scaling CUDA C++ Applications to Multiple Nodes which have been first on hold and then finally discontinued in 2025 and 2026.
Level: Intermediate to advanced
Language: English (German upon request for bespoke courses)
Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).
Knowledge
- Experience with CUDA C++ GPU programming, including memory allocation, kernel launches, grid-stride loops, and error handling (equivalent to the Introduction to CUDA C/C++ course)
- Familiarity with the Linux command line as well as compiling and running CUDA applications
Technical
- A modern web browser (for JupyterHub access to NHR@FAU’s HPC clusters)
- A local installation of NVIDIA Nsight Systems
After completing this course, you will be able to:
- Port a CPU application to a single GPU using CUDA managed memory and prefetching
- Use concurrent CUDA streams to overlap memory transfers with GPU computation
- Scale CUDA C++ workloads across multiple GPUs within a single compute node
- Enable and exploit direct peer-to-peer GPU memory access for efficient intra-node communication
- Write portable, scalable SPMD code using CUDA-aware MPI with inter-node GPU communication
- Apply NVSHMEM for GPU-initiated data transfers using the symmetric memory model
- Implement domain decomposition and halo exchange patterns for distributed GPU workloads
- Profile multi-GPU execution and identify performance bottlenecks with NVIDIA Nsight Systems
- Motivation and the running example: a 2D heat-diffusion stencil scaled throughout the course
- CPU baseline and single-GPU port: managed memory, 2D execution configuration, and prefetching
- Algorithmic work partitioning: decomposing the domain into patches for multi-GPU execution
- CUDA streams: concurrent per-patch execution and Nsight Systems timeline analysis
- Multi-GPU within a node: device management, per-patch allocations, and halo exchange
- Direct inter-GPU communication: unified virtual addressing and peer-to-peer transfers
- Overlapping communication and computation with multiple streams
- Multi-node parallelism with MPI: rank-to-GPU mapping, CUDA-aware MPI, and GPUDirect RDMA
- NVSHMEM: the symmetric-memory model and GPU-initiated one-sided communication
- Outlook: hierarchical reductions (CUB, NCCL), multi-dimensional domain decomposition, and parallel I/O
- 2026, Sep 7-9: three-day online course in collaboration with NHR@TUD (Register); part 2 of From Zero to Multi-Node GPU Programming
- 2026, Mar 10-11: two-day online course in collaboration with NHR@TUD; part 2 of From Zero to Multi-Node GPU Programming
For an overview of all NHR@FAU courses, visit the course overview page.