Scaling CUDA-Accelerated Applications

Scaling a GPU application beyond a single accelerator requires both intra-node and inter-node parallelism. This course provides a comprehensive treatment of both: part one covers CUDA streams, multi-GPU execution within a node, and direct peer-to-peer GPU memory access; part two extends that foundation across compute nodes using CUDA-aware MPI and NVSHMEM, including 1D domain decomposition and halo-exchange patterns, with copy/compute overlap as a recurring optimization. A single 2D heat-diffusion stencil serves as the running example, refined step by step from a CPU baseline through managed memory and algorithmic partitioning to distributed multi-GPU execution. Each hands-on step is provided at multiple difficulty levels, from guided starting points to full solutions.

This course was developed to replace the two formerly separate NVIDIA DLI courses Accelerating CUDA C++ Applications with Multiple GPUs and Scaling CUDA C++ Applications to Multiple Nodes which have been first on hold and then finally discontinued in 2025 and 2026.

Level: Intermediate to advanced

Language: English (German upon request for bespoke courses)

Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).

Knowledge

  • Experience with CUDA C++ GPU programming, including memory allocation, kernel launches, grid-stride loops, and error handling (equivalent to the Introduction to CUDA C/C++ course)
  • Familiarity with the Linux command line as well as compiling and running CUDA applications

Technical

  • A modern web browser (for JupyterHub access to NHR@FAU’s HPC clusters)
  • A local installation of NVIDIA Nsight Systems

After completing this course, you will be able to:

  • Port a CPU application to a single GPU using CUDA managed memory and prefetching
  • Use concurrent CUDA streams to overlap memory transfers with GPU computation
  • Scale CUDA C++ workloads across multiple GPUs within a single compute node
  • Enable and exploit direct peer-to-peer GPU memory access for efficient intra-node communication
  • Write portable, scalable SPMD code using CUDA-aware MPI with inter-node GPU communication
  • Apply NVSHMEM for GPU-initiated data transfers using the symmetric memory model
  • Implement domain decomposition and halo exchange patterns for distributed GPU workloads
  • Profile multi-GPU execution and identify performance bottlenecks with NVIDIA Nsight Systems

  • Motivation and the running example: a 2D heat-diffusion stencil scaled throughout the course
  • CPU baseline and single-GPU port: managed memory, 2D execution configuration, and prefetching
  • Algorithmic work partitioning: decomposing the domain into patches for multi-GPU execution
  • CUDA streams: concurrent per-patch execution and Nsight Systems timeline analysis
  • Multi-GPU within a node: device management, per-patch allocations, and halo exchange
  • Direct inter-GPU communication: unified virtual addressing and peer-to-peer transfers
  • Overlapping communication and computation with multiple streams
  • Multi-node parallelism with MPI: rank-to-GPU mapping, CUDA-aware MPI, and GPUDirect RDMA
  • NVSHMEM: the symmetric-memory model and GPU-initiated one-sided communication
  • Outlook: hierarchical reductions (CUB, NCCL), multi-dimensional domain decomposition, and parallel I/O

For an overview of all NHR@FAU courses, visit the course overview page.