• Skip navigation
  • Skip to navigation
  • Skip to the bottom
Simulate organization breadcrumb open Simulate organization breadcrumb close
NHR@FAU
  • FAUTo the central FAU website
Suche öffnen
  • RRZE
  • NHR-Verein e.V.
  • Gauß-Allianz

NHR@FAU

Navigation Navigation close
  • News
  • About us
    • People
    • Funding
    • BayernKI
    • NHR Compute Time Projects
    • Tier3 User Project Reports
    • Support Success Stories
    • Annual Reports
    • NHR@FAU Newsletters
    • Previous Events
    • Jobs
    Portal About us
  • Research
    • Research Focus
    • Publications, Posters & Talks
    • Performance Tools and Libraries
    • NHR PerfLab Seminar
    • Projects
    • Workshops
    • Awards
    Portal Research
  • Teaching & Training
    • Lectures & Seminars
    • Tutorials & Courses
    • Monthly HPC Café and Beginner’s Introduction
    • Theses
    • Student Cluster Competition
    Portal Teaching & Training
  • Systems & Services
    • Systems, Documentation & Instructions
    • Support & Contact
    • HPC User Training
    • HPC System Utilization
    Portal Systems & Services
  • FAQ

NHR@FAU

  1. Home
  2. Teaching & Training
  3. Tutorials & Courses
  4. Accelerating CUDA C++ Applications with Multiple GPUs

Accelerating CUDA C++ Applications with Multiple GPUs

In page navigation: Teaching & Training
  • Lectures & Seminars
  • Tutorials & Courses
    • Accelerating CUDA C++ Applications with Multiple GPUs
    • C++ for Beginners
    • Core-Level Performance Engineering
    • Fundamentals of Accelerated Computing with CUDA C/C++
    • Fundamentals of Accelerated Computing with CUDA Python
    • Fundamentals of Accelerated Computing with Modern CUDA C++
    • Fundamentals of Accelerated Computing with OpenACC
    • GPU Performance Engineering
    • Hybrid Programming in HPC - MPI+X
    • Introduction to OpenMP
    • Introduction to the LIKWID Tool Suite
    • Modern C++ Software Design
    • Node-Level Performance Engineering
    • Parallel Programming of High-Performance Systems (PPHPS)
    • Performance Engineering for Linear Solvers
    • Scaling CUDA C++ Applications to Multiple Nodes
  • Monthly HPC Café and Beginner's Introduction
  • Theses
  • Student Cluster Competition

Accelerating CUDA C++ Applications with Multiple GPUs

Course Description

This advanced course explores techniques for extending single-GPU applications to utilize multiple GPUs within a single compute node. It focuses on distributing workloads across multiple accelerators, optimizing performance through overlapping computation and data transfers, and using NVIDIA Nsight Systems to analyze execution behavior and identify performance bottlenecks.

Additional information is available on the Nvidia DLI course homepage.

Learning Objectives

At the conclusion of the workshop, you will be able to:

  • Use concurrent CUDA streams to overlap memory transfers with GPU computation
  • Scale workloads across available GPUs on a single node
  • Combine memory copy/compute overlap with multiple GPUs
  • Utilize the NVIDIA Nsight Systems timeline to identify improvement opportunities and assess the impact of the techniques covered in the workshop

Course Structure

Introduction to CUDA Streams

  • Get familiar with your GPU-accelerated interactive JupyterLab environment.
  • Orient yourself with a single GPU CUDA C++ application that will be the starting point for the course.
  • Observe the current performance of the single GPU CUDA C++ application using Nsight Systems.
  • Learn the rules that govern concurrent CUDA stream behavior.
  • Use multiple CUDA streams to perform concurrent host-to-device and device-to-host memory transfers.
  • Utilize multiple CUDA streams for launching GPU kernels.
  • Observe multiple streams in the Nsight Systems Visual Profiler timeline view.

Multiple GPUs with CUDA C++

  • Learn the key concepts for effectively using multiple GPUs on a single node with CUDA C++.
  • Explore robust indexing strategies for the flexible use of multiple GPUs in applications.
  • Refactor the single-GPU CUDA C++ application to utilize multiple GPUs.
  • See multiple-GPU utilization in the Nsight Systems Visual Profiler timeline.

Copy/ Compute Overlap with CUDA Streams

  • Learn the key concepts for effectively performing copy/ compute overlap.
  • Explore robust indexing strategies for the flexible use of copy/ compute overlap in applications.
  • Refactor the single-GPU CUDA C++ application to perform copy/ compute overlap.
  • See copy/ compute overlap in the Nsight Systems visual profiler timeline.

Copy/ Compute Overlap with Multiple GPUs

  • Learn the key concepts for effectively performing copy/ compute overlap on multiple GPUs.
  • Explore robust indexing strategies for the flexible use of copy/ compute overlap on multiple GPUs.
  • Refactor the single-GPU CUDA C++ application to perform copy/ compute overlap on multiple GPUs.
  • Observe performance benefits for copy/ compute overlap on multiple GPUs.
  • See copy/ compute overlap on multiple GPUs in the Nsight Systems visual profiler timeline.

Certification

Upon successfully completing the course assessments, participants will receive an NVIDIA DLI Certificate, recognizing their subject matter expertise and supporting their professional career growth.

Prerequisites

A free NVIDIA developer account is required to access the course material. Please register before the training at https://learn.nvidia.com/join.

Participants should additionally meet the following requirements:

  • Successful completion of Fundamentals of Accelerated Computing with CUDA C/C++, or equivalent experience in implementing CUDA C/C++ applications, including:
    • Memory allocation
    • Host-to-device and device-to-host memory transfers
    • Kernel launches
    • Grid-stride loops
    • CUDA error handling
  • Familiarity with the Linux command line
  • Experience using Makefiles

Upcoming Iterations and Additional Courses

You can find dates and registration links for this and other upcoming NHR@FAU courses at https://hpc.fau.de/teaching/tutorials-and-courses/.

Erlangen National High Performance Computing Center (NHR@FAU)
Martensstraße 1
91058 Erlangen
Germany
  • Imprint
  • Privacy
  • Accessibility
  • How to find us
  • RSS Feed
Up