Accelerating CUDA C++ Applications with Multiple GPUs

Modern compute nodes are typically equipped with multiple GPUs, and making full use of that hardware requires distributing work explicitly across accelerators. This NVIDIA DLI course teaches developers how to extend single-GPU CUDA C++ applications to utilize all GPUs within a node. It covers CUDA streams for concurrent execution, peer-to-peer GPU memory access, copy/compute overlap, and performance analysis with NVIDIA Nsight Systems.

Further information about this tutorial can be found on the NVIDIA DLI course page.

This course was first on hold and then finally discontinued in 2025 and 2026. NHR@FAU offers Scaling CUDA-Accelerated Applications as an alternative that covers both multi-GPU and multi-node content in a single, streamlined tutorial.

Level: Intermediate

Language: English (German upon request for bespoke courses)

Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).

Knowledge

Experience with CUDA C++ GPU programming, including memory allocation, kernel launches, grid-stride loops, and error handling (equivalent to the Introduction to CUDA C++ course)
Familiarity with the Linux command line and Makefiles

Technical

An up-to-date browser for accessing the course materials and online environment
A free NVIDIA developer account
A local installation of NVIDIA Nsight Systems is recommended

After completing this course, you will be able to:

Extend single-GPU CUDA C++ applications to utilize multiple GPUs within a node
Distribute and index workloads across multiple accelerators
Use concurrent CUDA streams to overlap memory transfers with GPU computation
Combine copy/compute overlap with multi-GPU execution for maximum throughput
Analyze multi-GPU execution timelines and identify bottlenecks with NVIDIA Nsight Systems

CUDA streams: concurrency rules, stream-based execution, and Nsight Systems profiling
Copy/compute overlap: overlapping data transfers with kernel execution on a single GPU
Multi-GPU programming: device management, workload indexing, and application refactoring
Copy/compute overlap with multiple GPUs: combining both techniques and visualizing performance

2025, Sep 15-16: two half-day online course in collaboration with NHR@TUD; part 2 of From Zero to Multi-Node GPU Programming
2025, Mar 19: full-day online course in collaboration with NHR@TUD, EUMaster4HPC; part 2 of From Zero to Multi-Node GPU Programming
2025, Feb 6: full-day online course in collaboration with LRZ; part 4 of GPU Programming Workshop
2024, Sep 25: full-day online course in collaboration with NHR@TUD; part 2 of From Zero to Multi-Node GPU Programming
2024, Apr 5: full-day online course; part 1 of Multi-GPU Programming with CUDA C++
2024, Feb 8: full-day online course

For an overview of all NHR@FAU courses, visit the course overview page.

Accelerating CUDA C++ Applications with Multiple GPUs

Course Details

Prerequisites

Learning Outcomes

Course Outline

Past Events (6)