Porting code to the GPU can yield significant speedups, but achieving good GPU utilization requires understanding where and why performance falls short. This advanced course addresses exactly that challenge: it introduces NVIDIA’s profiling ecosystem – Nsight Systems for application-level timeline analysis and Nsight Compute for individual kernel assessment – and pairs them with resource-based performance models that let developers judge how close their code comes to the hardware’s theoretical limits. Instrumentation with NVTX markers is also covered to improve profiler output legibility. Beyond diagnosis, the course derives realistic performance limits from micro-benchmarks, applies the roofline model to a 2D-stencil running example, and works through concrete optimizations – reducing host-device transfers, raising occupancy and parallelism, and improving memory access patterns and cache reuse – before concluding with a hands-on conjugate-gradient optimization challenge. Code examples are provided in CUDA, OpenMP target offloading, and OpenACC.
This course is the significantly extended successor to the earlier Performance Analysis on GPUs with NVIDIA Tools course, which was offered as a standalone half-day course until 2025. NHR@FAU also offers a condensed two-hour GPU Performance Analysis module for integration into summer schools and other larger events.
Level: Intermediate to advanced
Language: English (German upon request for bespoke courses)
Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).
Knowledge
- Experience with GPU programming in CUDA or OpenMP offloading using C/C++
Technical
- A modern web browser (for JupyterHub access to NHR@FAU’s HPC clusters)
- A local installation of NVIDIA Nsight Systems and Nsight Compute (no local GPU required)
After completing this course, you will be able to:
- Instrument GPU applications with NVTX markers to produce interpretable profiler output
- Use the Nsight Systems CLI and GUI to capture and analyze application-level timelines
- Use the Nsight Compute CLI and GUI to assess individual CUDA kernel performance
- Apply resource-based performance models to determine theoretical performance limits
- Build and interpret roofline models (arithmetic intensity and machine balance) for GPU kernels
- Derive realistic bandwidth and compute limits from micro-benchmarks
- Identify the dominant bottleneck of a GPU kernel and quantify the gap to peak performance
- Prioritize optimization effort based on profiling data and performance model predictions
- Apply targeted optimizations: reduce host-device transfers, raise occupancy and parallelism, and improve memory coalescing and cache reuse
- GPU architecture fundamentals and the roofline model for GPUs
- Application instrumentation with NVTX and timeline analysis with Nsight Systems
- Kernel-level profiling with Nsight Compute: metrics, roofline, and memory analysis
- Micro-benchmarking memory bandwidth and compute throughput
- Occupancy and parallelism optimization
- Interpreting bottleneck indicators and guiding optimization decisions
- Challenge: profiling and optimizing a conjugate-gradient solver
- 2026, Sep 28-30: three half-day online course (Register)
- 2026, Apr 22-24: three half-day online course
- 2025, Oct 8: full-day online course
- 2025, Apr 11: full-day online course
For an overview of all NHR@FAU courses, visit the course overview page.