CUDA is NVIDIA’s parallel computing platform for GPU-accelerated C/C++ applications. The course covers the full arc from writing and running a first GPU kernel, through systematically porting existing CPU code to the GPU, to understanding the execution model and optimizing with shared memory, atomics, and reductions. Short teaching segments alternate with exercises in interactive Jupyter notebooks; optional modules extend the course to debugging techniques and CUDA streams.
This course was developed as an updated replacement for the discontinued Fundamentals of Accelerated Computing with CUDA C/C++ NVIDIA DLI course. NVIDIA DLI’s own successor in this space, the Fundamentals of Accelerated Computing with Modern CUDA C++, is also offered by NHR@FAU and covers the similar fundamentals, but with a strong focus on modern C++ and Thrust.
Level: Beginner
Language: English (German upon request for bespoke courses)
Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).
Knowledge
- C or C++ programming experience, including variables, loops, conditionals, functions, and arrays
Technical
- A modern web browser (for JupyterHub access to NHR@FAU’s HPC clusters)
- A local installation of NVIDIA Nsight Systems (no local GPU required)
After completing this course, you will be able to:
- Write, compile, and run GPU-accelerated C/C++ code using CUDA
- Map computations to CUDA’s thread hierarchy and launch kernels with suitable execution configurations
- Systematically port existing CPU applications to the GPU following a structured porting workflow
- Understand the GPU execution model and its implications for choosing thread block sizes and grid dimensions
- Use shared memory, atomic operations, and optimize reduction patterns
- Profile GPU applications with NVIDIA Nsight Systems to identify and address performance bottlenecks
- [Optional] Use CUDA streams for concurrent kernel execution and copy/compute overlap
- Introduction: GPU computing overview, course tooling, and interactive Jupyter notebook workflow
- First GPU application: CUDA kernel syntax, compilation, and the thread hierarchy
- Porting applications to GPU: systematic porting approach and GPU memory management
- GPU architecture: the thread execution model and selecting execution configurations
- Reductions, atomics, and shared memory: avoiding race conditions and optimizing reduction patterns on GPUs
- [Optional] Debugging CUDA applications: techniques for debugging GPU applications and writing robust code
- [Optional] CUDA streams: concurrent kernel execution and copy/compute overlap
- 2026, Sep 3-4: two half-day online course in collaboration with NHR@TUD (Register); part 1 of From Zero to Multi-Node GPU Programming
- 2026, Jun 10-11: one and a half day online course in collaboration with LRZ; part 3 of GPU Programming Workshop
- 2026, Mar 9: full-day online course in collaboration with NHR@TUD; part 1 of From Zero to Multi-Node GPU Programming
For an overview of all NHR@FAU courses, visit the course overview page.