• Skip navigation
  • Skip to navigation
  • Skip to the bottom
Simulate organization breadcrumb open Simulate organization breadcrumb close
NHR@FAU
  • FAUTo the central FAU website
Suche öffnen
  • RRZE
  • NHR-Verein e.V.
  • Gauß-Allianz

NHR@FAU

Navigation Navigation close
  • News
  • About us
    • People
    • Funding
    • BayernKI
    • NHR Compute Time Projects
    • Tier3 User Project Reports
    • Support Success Stories
    • Annual Reports
    • NHR@FAU Newsletters
    • Previous Events
    • Jobs
    Portal About us
  • Research
    • Research Focus
    • Publications, Posters & Talks
    • Performance Tools and Libraries
    • NHR PerfLab Seminar
    • Projects
    • Workshops
    • Awards
    Portal Research
  • Teaching & Training
    • Lectures & Seminars
    • Tutorials & Courses
    • Monthly HPC Café and Beginner’s Introduction
    • Theses
    • Student Cluster Competition
    Portal Teaching & Training
  • Systems & Services
    • Systems, Documentation & Instructions
    • Support & Contact
    • HPC User Training
    • User Projects
    Portal Systems & Services
  • FAQ

NHR@FAU

  1. Home
  2. Teaching & Training
  3. Tutorials & Courses
  4. Scaling CUDA-Accelerated Applications

Scaling CUDA-Accelerated Applications

In page navigation: Teaching & Training
  • Lectures & Seminars
  • Tutorials & Courses
    • Accelerating CUDA C++ Applications with Multiple GPUs
    • C++ for Beginners
    • Choosing GPU Programming Approaches
    • Core-Level Performance Engineering
    • FAQ about NHR@FAU Trainings
    • From Zero to Multi-Node GPU Programming
    • Fundamentals of Accelerated Computing with CUDA C/C++
    • Fundamentals of Accelerated Computing with CUDA Python
    • Fundamentals of Accelerated Computing with Modern CUDA C++
    • Fundamentals of Accelerated Computing with OpenACC
    • Fundamentals of Deep Learning
    • GPU Performance Engineering
    • Hybrid Programming in HPC - MPI+X
    • Introduction to Git
    • Introduction to OpenMP
    • Introduction to Parallel Programming with MPI
    • Introduction to the LIKWID Tool Suite
    • Modern C++ Software Design
    • Node-Level Performance Engineering
    • Parallel Programming of High-Performance Systems (PPHPS)
    • Performance Engineering for Linear Solvers
    • Scaling CUDA C++ Applications to Multiple Nodes
    • Scaling CUDA-Accelerated Applications
  • Monthly HPC Café and Beginner's Introduction
  • Theses
  • Student Cluster Competition

Scaling CUDA-Accelerated Applications

Course Description

This advanced course explores techniques for extending single-GPU applications to utilize multiple GPUs within a single compute node, and across multiple GPU-equipped nodes. It discusses work partitioning approaches, data sharing and exchange strategies, and overlapping of computation and data transfers. Key technologies covered are CUDA streams, CUDA device management, GPUDirect P2P, (CUDA-aware) MPI, and NVSHMEM. All modules are supported by hands-on exercises in multiple levels of difficulty, as well as profiling with Nsight Systems to explore parallel behavior of implemented applications.

Learning Objectives

At the conclusion of the workshop, you will be able to:

  • Use concurrent CUDA streams to overlap
    • memory transfers with GPU computation,
    • GPU-to-GPU communication with GPU computation, and
    • MPI communication with GPU computation.
  • Use several methods for writing multi-GPU CUDA C++ applications.
  • Scale workloads across available GPUs on a single node.
  • Scale workloads across available GPUs on multiple nodes.
  • Utilize the NVIDIA Nsight Systems timeline to identify improvement opportunities and assess the impact of the techniques covered in the workshop.
  • Write portable, scalable CUDA code with the single-program multiple-data (SPMD) paradigm using CUDA-aware MPI and NVSHMEM.
  • Apply common multi-GPU coding paradigms like domain decomposition and halo exchanges.

Course Structure

  • CPU Baseline and first GPU port using CUDA managed memory.
  • Work partitioning on an algorithmic level.
  • CUDA streams for parallel operations.
  • Multi-GPU execution and work distribution.
  • Halo-exchanges for improved data locality.
  • P2P communication for optimized transfers between GPUs.
  • Overlapping communication and computation.
  • MPI ‘hello world’ and MPI through the lens of a CUDA programmer.
  • MPI parallelization for multi-node parallelism.
  • Overlapping MPI exchanges and computation.
  • NVSHMEM as an alternative.

Certification

A digital certificate of attendance will be awarded to all participants who attended the majority of the course.

Prerequisites

Participants should meet the following requirements:

  • Successful completion of Fundamentals of Accelerated Computing with CUDA C/C++, or equivalent experience in implementing CUDA C/C++ applications, including:
    • Memory allocation
    • Host-to-device and device-to-host memory transfers
    • Kernel launches
    • Grid-stride loops
    • CUDA error handling
  • Familiarity with the Linux command line
  • Experience compiling and executing CUDA applications

Upcoming Iterations and Additional Courses

You can find dates and registration links for this and other upcoming NHR@FAU courses at https://go-nhr.de/trainings .

Erlangen National High Performance Computing Center (NHR@FAU)
Martensstraße 1
91058 Erlangen
Germany
  • Imprint
  • Privacy
  • Accessibility
  • How to find us
  • RSS Feed
Up