GPU Performance Engineering

Porting code to the GPU can yield significant speedups, but achieving good GPU utilization requires understanding where and why performance falls short. This advanced course addresses exactly that challenge: it introduces NVIDIA’s profiling ecosystem – Nsight Systems for application-level timeline analysis and Nsight Compute for individual kernel assessment – and pairs them with resource-based performance models that let developers judge how close their code comes to the hardware’s theoretical limits. Instrumentation with NVTX markers is also covered to improve profiler output legibility. Beyond diagnosis, the course derives realistic performance limits from micro-benchmarks, applies the roofline model to a 2D-stencil running example, and works through concrete optimizations – reducing host-device transfers, raising occupancy and parallelism, and improving memory access patterns and cache reuse – before concluding with a hands-on conjugate-gradient optimization challenge. Code examples are provided in CUDA, OpenMP target offloading, and OpenACC.

This course is the significantly extended successor to the earlier Performance Analysis on GPUs with NVIDIA Tools course, which was offered as a standalone half-day course until 2025. NHR@FAU also offers a condensed two-hour GPU Performance Analysis module for integration into summer schools and other larger events.

Level: Intermediate to advanced

Language: English (German upon request for bespoke courses)

Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).

Knowledge

Experience with GPU programming in CUDA or OpenMP offloading using C/C++

Technical

A modern web browser (for JupyterHub access to NHR@FAU’s HPC clusters)
A local installation of NVIDIA Nsight Systems and Nsight Compute (no local GPU required)

After completing this course, you will be able to:

Instrument GPU applications with NVTX markers to produce interpretable profiler output
Use the Nsight Systems CLI and GUI to capture and analyze application-level timelines
Use the Nsight Compute CLI and GUI to assess individual CUDA kernel performance
Apply resource-based performance models to determine theoretical performance limits
Build and interpret roofline models (arithmetic intensity and machine balance) for GPU kernels
Derive realistic bandwidth and compute limits from micro-benchmarks
Identify the dominant bottleneck of a GPU kernel and quantify the gap to peak performance
Prioritize optimization effort based on profiling data and performance model predictions
Apply targeted optimizations: reduce host-device transfers, raise occupancy and parallelism, and improve memory coalescing and cache reuse

GPU architecture fundamentals and the roofline model for GPUs
Application instrumentation with NVTX and timeline analysis with Nsight Systems
Kernel-level profiling with Nsight Compute: metrics, roofline, and memory analysis
Micro-benchmarking memory bandwidth and compute throughput
Occupancy and parallelism optimization
Interpreting bottleneck indicators and guiding optimization decisions
Challenge: profiling and optimizing a conjugate-gradient solver

2026, Sep 28-30: three half-day online course (Register)

2026, Apr 22-24: three half-day online course
2025, Oct 8-10: three half-day online course
2025, Apr 11: full-day online course

For an overview of all NHR@FAU courses, visit the course overview page.

GPU Performance Engineering

Course Details

Prerequisites

Learning Outcomes

Course Outline

Upcoming Events

Past Events (3)