• Skip navigation
  • Skip to navigation
  • Skip to the bottom
Simulate organization breadcrumb open Simulate organization breadcrumb close
NHR@FAU
  • FAUTo the central FAU website
Suche öffnen
  • RRZE
  • NHR-Verein e.V.
  • Gauß-Allianz

NHR@FAU

Navigation Navigation close
  • News
  • About us
    • People
    • Funding
    • BayernKI
    • NHR Compute Time Projects
    • Tier3 User Project Reports
    • Support Success Stories
    • Annual Reports
    • NHR@FAU Newsletters
    • Previous Events
    • Jobs
    Portal About us
  • Research
    • Research Focus
    • Publications, Posters & Talks
    • Performance Tools and Libraries
    • NHR PerfLab Seminar
    • Projects
    • Workshops
    • Awards
    Portal Research
  • Teaching & Training
    • Lectures & Seminars
    • Tutorials & Courses
    • Monthly HPC Café and Beginner’s Introduction
    • Theses
    • Student Cluster Competition
    Portal Teaching & Training
  • Systems & Services
    • Systems, Documentation & Instructions
    • Support & Contact
    • HPC User Training
    • HPC System Utilization
    Portal Systems & Services
  • FAQ

NHR@FAU

  1. Home
  2. Teaching & Training
  3. Tutorials & Courses
  4. Scaling CUDA C++ Applications to Multiple Nodes

Scaling CUDA C++ Applications to Multiple Nodes

In page navigation: Teaching & Training
  • Lectures & Seminars
  • Tutorials & Courses
    • Accelerating CUDA C++ Applications with Multiple GPUs
    • C++ for Beginners
    • Core-Level Performance Engineering
    • Fundamentals of Accelerated Computing with CUDA C/C++
    • Fundamentals of Accelerated Computing with CUDA Python
    • Fundamentals of Accelerated Computing with Modern CUDA C++
    • Fundamentals of Accelerated Computing with OpenACC
    • GPU Performance Engineering
    • Hybrid Programming in HPC - MPI+X
    • Introduction to OpenMP
    • Introduction to the LIKWID Tool Suite
    • Modern C++ Software Design
    • Node-Level Performance Engineering
    • Parallel Programming of High-Performance Systems (PPHPS)
    • Performance Engineering for Linear Solvers
    • Scaling CUDA C++ Applications to Multiple Nodes
  • Monthly HPC Café and Beginner's Introduction
  • Theses
  • Student Cluster Competition

Scaling CUDA C++ Applications to Multiple Nodes

Course Description

This advanced course covers multi-node programming techniques for GPU-accelerated applications and examines advanced examples, with a special emphasis on using MPI and NVSHMEM to distribute workloads efficiently.

Additional information is available on the Nvidia DLI course homepage.

Learning Objectives

At the conclusion of the workshop, you will be able to:

  • Use several methods for writing multi-GPU CUDA C++ applications,
  • Use a variety of multi-GPU communication patterns and understand their tradeoffs,
  • Write portable, scalable CUDA code with the single-program multiple-data (SPMD) paradigm using CUDA-aware MPI and NVSHMEM,
  • Improve multi-GPU SPMD code with NVSHMEM’s symmetric memory model and its ability to perform GPU-initiated data transfers, and
  • Get practice with common multi-GPU coding paradigms like domain decomposition and halo exchanges.

Course Structure

Multi-GPU Programming Paradigms

  • Survey multiple techniques for programming CUDA C++ applications for multiple GPUs using a Monte-Carlo approximation of Pi CUDA C++ program.
  • Use CUDA to utilize multiple GPUs.
  • Learn how to enable and use direct peer-to-peer memory communication.
  • Write an SPMD version with CUDA-aware MPI.

Introduction to NVSHMEM

  • Learn how to write code with NVSHMEM and understand its symmetric memory model.
  • Use NVSHMEM to write SPMD code for multiple GPUs.
  • Utilize symmetric memory to let all GPUs access data on other GPUs.
  • Make GPU-initiated memory transfers.

Halo Exchanges with NVSHMEM

  • Practice common coding motifs like halo exchanges and domain decomposition using NVSHMEM, and work on the assessment.
  • Write an NVSHMEM implementation of a Laplace equation Jacobi solver.
  • Refactor a single GPU 1D wave equation solver with NVSHMEM.
  • Complete the assessment and earn a certificate.

Certification

Upon successfully completing the course assessments, participants will receive an NVIDIA DLI Certificate, recognizing their subject matter expertise and supporting their professional career growth.

Prerequisites

A free NVIDIA developer account is required to access the course material. Please register before the training at https://learn.nvidia.com/join.

Participants should additionally meet the following requirements:

  • Successful completion of Fundamentals of Accelerated Computing with CUDA C/C++, or equivalent experience in implementing CUDA C/C++ applications, including:
    • Memory allocation
    • Host-to-device and device-to-host memory transfers
    • Kernel launches
    • Grid-stride loops
    • CUDA error handling
  • [Optional but recommended]: Prior attendance to Accelerating CUDA C++ Applications with Multiple GPUs
  • Familiarity with the Linux command line
  • Experience using Makefiles

Upcoming Iterations and Additional Courses

You can find dates and registration links for this and other upcoming NHR@FAU courses at https://hpc.fau.de/teaching/tutorials-and-courses/.

Erlangen National High Performance Computing Center (NHR@FAU)
Martensstraße 1
91058 Erlangen
Germany
  • Imprint
  • Privacy
  • Accessibility
  • How to find us
  • RSS Feed
Up