Modern compute nodes are typically equipped with multiple GPUs, and making full use of that hardware requires distributing work explicitly across accelerators. This NVIDIA DLI course teaches developers how to extend single-GPU CUDA C++ applications to utilize all GPUs within a node. It covers CUDA streams for concurrent execution, peer-to-peer GPU memory access, copy/compute overlap, and performance analysis with NVIDIA Nsight Systems.
Further information about this tutorial can be found on the NVIDIA DLI course page.
This course was first on hold and then finally discontinued in 2025 and 2026. NHR@FAU offers Scaling CUDA-Accelerated Applications as an alternative that covers both multi-GPU and multi-node content in a single, streamlined tutorial.
Level: Intermediate
Language: English (German upon request for bespoke courses)
Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).
Knowledge
- Experience with CUDA C++ GPU programming, including memory allocation, kernel launches, grid-stride loops, and error handling (equivalent to the Introduction to CUDA C++ course)
- Familiarity with the Linux command line and Makefiles
Technical
- An up-to-date browser for accessing the course materials and online environment
- A free NVIDIA developer account
- A local installation of NVIDIA Nsight Systems is recommended
After completing this course, you will be able to:
- Extend single-GPU CUDA C++ applications to utilize multiple GPUs within a node
- Distribute and index workloads across multiple accelerators
- Use concurrent CUDA streams to overlap memory transfers with GPU computation
- Combine copy/compute overlap with multi-GPU execution for maximum throughput
- Analyze multi-GPU execution timelines and identify bottlenecks with NVIDIA Nsight Systems
- CUDA streams: concurrency rules, stream-based execution, and Nsight Systems profiling
- Copy/compute overlap: overlapping data transfers with kernel execution on a single GPU
- Multi-GPU programming: device management, workload indexing, and application refactoring
- Copy/compute overlap with multiple GPUs: combining both techniques and visualizing performance
- 2025, Sep 15-16: two half-day online course in collaboration with NHR@TUD; part 2 of From Zero to Multi-Node GPU Programming
- 2025, Mar 19: full-day online course in collaboration with NHR@TUD, EUMaster4HPC; part 2 of From Zero to Multi-Node GPU Programming
- 2025, Feb 6: full-day online course in collaboration with LRZ; part 4 of GPU Programming Workshop
- 2024, Sep 25: full-day online course in collaboration with NHR@TUD; part 2 of From Zero to Multi-Node GPU Programming
- 2024, Apr 5: full-day online course; part 1 of Multi-GPU Programming with CUDA C++
- 2024, Feb 8: full-day online course
For an overview of all NHR@FAU courses, visit the course overview page.