Tutorials & Courses
Beyond the curricular teaching activities of the Professorship of High Performance Computing at FAU, we offer a wide spectrum of courses on parallel programming, GPU programming, code optimization, modern C++, and more, from introductory to advanced levels. We are particularly known for our “Node-Level Performance Engineering” tutorials and courses, which we provide regularly at the the IEEE/ACM Supercomputing conference series and the German Gauss Centre for Supercomputing (GCS) sites at Garching (LRZ) and Stuttgart (HLRS), and at Vienna Scientific Cluster (VSC) at TU Wien. At these sites, we are also actively involved in “MPI+X” hybrid programming tutorials in close collaboration with lecturers from HLRS and VSC.
Upon request we offer our course program for other interested computing centers, research institutions, and industry.
To see upcoming dates for our courses, please click on the name of the course you are interested in.
If you want to participate in one of our courses, please find the link to the registration in the respective accordion section.
Overview of the entire course program
HPC Introduction
This long-standing course is a collaboration of Erlangen National High Performance Computing Center (NHR@FAU) and Leibniz Supercomputing Center (LRZ). It is targeted at students and scientists with interest in programming modern HPC hardware, specifically the large scale parallel computing systems available in Jülich, Stuttgart and Munich, but also smaller clusters in Tier-2/3 centers and departments.
Upcoming
- Three-day on-site course at LRZ Garching (PPHPS25), February 18-20, 2025 (Alireza Ghasemi, Georg Hager together with LRZ staff)
Past
- Three-day on-site course at NHR@FAU (PPHPS24), February 20-22, 2024 (Ayesha Afzal, Markus Wittmann, Georg Hager together with LRZ staff)
- Three-day online course (PPHPS23), March 7-9, 2023 (Ayesha Afzal, Markus Wittmann, Georg Hager together with LRZ staff)
- Three-day online course (PPHPS22), March 8–10, 2022 (Ayesha Afzal, Markus Wittmann, Georg Hager together with LRZ staff)
- Three-day online course (PPHPS21), April 13–15, 2021 (together with LRZ staff)
- Annual course at RRZE, March 9–13, 2020 (together with LRZ staff)
This course gives an introduction to the Message Passing Interface (MPI), the dominating distributed-memory programming paradigm in High Performance Computing.
Upcoming
- Two-day online course at NHR@FAU, April 9-10, 2025.
Past
- Two-day online course at NHR@FAU, April 11-12, 2024.
OpenMP is a standard for parallelizing shared memory C/C++ and Fortran applications. It is supported by major compilers and provides a simple, low-entry barrier for thread-based parallelization. This course gives an introduction to the basic workings and constructs used for parallelizing applications with OpenMP, and advanced topics such as tasking and accelerator offloading.
Upcoming
- Three day online course, Feb. 26-28, 2025
Past
- Three half-day online course, September 4-6, 2024
- Introduction to OpenMP: part 2 (online), March 12, 2024
- Introduction to OpenMP: part 1 (online), March 5, 2024
- Introduction to OpenMP: part 2 (online), September 27, 2023
- Introduction to OpenMP: part 1 (online), September 20, 2023
- Introduction to OpenMP: part 2 (online), March 28, 2023
- Introduction to OpenMP: part 1 (online), March 21, 2023
- Full-day online course, October 4, 2022.
At the conclusion of this workshop, participants will possess a robust understanding of the essential tools and techniques required for GPU-accelerating C/C++ applications with CUDA. Key takeaways include the ability to write GPU-executable code, harness data parallelism, optimize memory migration with asynchronous prefetching, employ command-line and visual profilers for guidance, utilize concurrent streams for enhanced parallelism, and apply a profile-driven approach to develop or refactor CUDA C/C++ applications for optimal performance.
Upcoming
- Full-day online course, part 1 of From Zero to Multi-Node GPU Programming, March 12, 2025 (in collaboration with NHR@TUD)
- Full-day online course, part 2 of GPU Programming Workshop, February 4, 2025 (in collaboration with LRZ)
Past
- Full-day online course, part 1 of From Zero to Multi-Node GPU Programming, September 18, 2024 (in collaboration with NHR@TUD)
- Two half-day online course, March 4-5, 2024 (in collaboration with EUMaster4HPC)
- Full-day online course, February 29, 2024
- Full-day in-person course, July 28, 2023
- Full-day in-person course, March 23, 2023
- Two half-day online course, March 8-9, 2023 (in collaboration with EUMaster4HPC)
- Two half-day online course, December 9 & 16, 2022 (in collaboration with EUMaster4HPC)
- Full-day online course, November 28, 2022 (in collaboration with LRZ Garching).
- Two half-day online course, April 21–22, 2022.
By the end of this workshop, participants will gain proficiency in fundamental tools and techniques for GPU-accelerated Python applications using CUDA and Numba. Highlights include the ability to GPU-accelerate NumPy ufuncs, configure code parallelization via the CUDA thread hierarchy, implement custom device kernels for optimal performance and flexibility, and employ memory coalescing and on-device shared memory to enhance the performance of CUDA kernels.
Upcoming
- Full-day online course, April 2, 2025
- Full-day online course, part 3 of GPU Programming Workshop, February 5, 2025 (in collaboration with LRZ)
Past
- Full-day online course, NVIDIA DLI Virtual Workshops for Higher Education, October 24, 2024
- Full-day online course, October 7, 2024
- Full-day online course, March 14, 2024
- Two half-day online course, March 6-7, 2024 (in collaboration with EUMaster4HPC)
- Full-day on-site course, September 18, 2023
- Full-day in-person course, March 16, 2023
- Two half-day online course, September 22–23, 2022
- Two half-day online course, August 02–03, 2022
By the end of this workshop, participants will have a basic understanding of OpenACC, a high-level programming language for programming on GPUs. Its focus is in profiling and optimize CPU-only applications to identify hot spots for acceleration, using OpenACC directives to GPU accelerate codebases, and optimizing data movement between the CPU and GPU accelerator.
Upcoming
- Full-day online course, April 16, 2025
- Full-day online course, part 1 of GPU Programming Workshop, February 3, 2025 (in collaboration with LRZ)
The Python programming language has become very popular in scientific computing for various reasons. Users not only implement prototypes for numerical experiments on small scales, but also develop parallel production codes, thereby partly replacing compiled languages such as C, C++, and Fortran. However, when following this approach it is crucial to pay special attention to performance. This course teaches approaches to use Python efficiently and reasonably in a HPC environment. The first lecture gives a whirlwind tour through the Python programming language and the standard library. In the following, the lectures strongly focus on performance-related topics such as NumPy, Cython, Numba, compiled C- and Fortran extensions, profiling of Python and compiled code, parallelism using multiprocessing and mpi4py, parallel frameworks such as Dask, and efficient IO with HDF5. In addition, we will cover topics more related to software-engineering such as packaging, publishing, testing, and the semi-automated generation of documentation. Finally, basic visualization tasks using matplotlib and similar packages are discussed.
Past
- Three-day online course (conducted by MPCDF), July 25-27, 2023
Advanced HPC
This course covers performance engineering approaches on the CPU core level.
While many developers put a lot of effort into optimizing parallelism, they often lose track of the importance of an efficient serial code first. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted. This course conveys a thorough understanding of the interactions between software and hardware on the level of a single CPU core and the lowest memory hierarchy level, the L1 cache. It covers general computer architecture for x86 and ARM processors, an introduction to (AT&T and AArch64) assembly code, and performance analysis and engineering using the Open Source Architecture Code Analyzer (OSACA) tool in combination with the Compiler Explorer.
Past
- Half-day on-site tutorial at at SC24, Atlanta, Georgia, November 17-22, 2024.
- Full-day online tutorial at NHR@FAU, October 8, 2024.
- Full-day on-site tutorial at PPAM 2024, the 15th International Conference on Parallel Processing & Applied Mathematics, Ostrava, Czech Republic, September 8-11, 2024
- Full-day on-site tutorial at PACT 2023, the 32nd International Conference on Parallel Architectures and Compilation Techniques, Vienna, Austria, October 21-25, 2023.
- Full-day online tutorial at NHR@FAU, October 12, 2023
- Full-day tutorial at ICPE 2023, the 14th ACM/SPEC International Conference on Performance Engineering, April 15-19, 2023, Coimbra, Portugal.
This course covers performance engineering approaches on the compute node level.
Even application developers who are fluent in OpenMP and MPI often lack a good grasp of how much performance could at best be achieved by their code. This is because parallelism takes us only half the way to good performance. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted. This course conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. We introduce the basic architectural features and bottlenecks of modern processors and compute nodes. Pipelining, SIMD, superscalarity, caches, memory interfaces, ccNUMA, etc., are covered. A cornerstone of node-level performance analysis is the Roofline model, which is introduced in due detail and applied to various examples from computational science. We also show how simple software tools can be used to acquire knowledge about the system, run code in a reproducible way, and validate hypotheses about resource consumption. Finally, once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of code changes can often be predicted, replacing hope-for-the-best optimizations by a scientific process.
Past
- Three-day online tutorial at the Leibniz Supercomputing Center (LRZ), December 3–5, 2024.
- Four-day online tutorial at the High Performance Computing Center Stuttgart (HLRS), June 18–21, 2024 (with ZIH staff).
- Three-day online tutorial at the Leibniz Supercomputing Center (LRZ), December 4–6, 2023.
- Full-day tutorial at Supercomputing 2023 (SC23), Nov 12–17, 2023, Denver, CO (with Gerhard Wellein and Thomas Gruber.)
- Three-day on-site tutorial at NHR@FAU, October 4-6, 2023.
- Half-day online tutorial at ISC High Performance 2023, May 11, 2023.
- Four-day online tutorial at the High Performance Computing Center Stuttgart (HLRS), June 27–30, 2023 (with ZIH staff.)
- Three-day online PRACE tutorial at the Leibniz Supercomputing Center (LRZ), December 5–7, 2022.
- Full-day tutorial at Supercomputing 2022 (SC22), Nov 13–18, 2022, Dallas, TX.
- Four-day online PRACE tutorial at the High Performance Computing Center Stuttgart (HLRS), June 28–July 1, 2022 (with ZIH staff.)
This tutorial covers code analysis, performance modeling, and optimization for linear solvers on CPU and GPU nodes. Performance Engineering is often taught using simple loops as instructive examples for performance models and how they can guide optimization; however, full, preconditioned linear solvers comprise multiple back-to-back loops enclosed in an iteration scheme that is executed until convergence is achieved. Consequently, the concept of “optimal performance” has to account for both hardware resource efficiency and iterative solver convergence. We convey a performance engineering process that is geared towards linear iterative solvers. After introducing basic notions of hardware organization and storage for dense and sparse data structures, we show how the Roofline performance model can be applied to such solvers in predictive and diagnostic ways and how it can be used to assess the hardware efficiency of a solver, covering important corner cases such as pure memory boundedness. Then we advance to the structure of preconditioned solvers, using the Conjugate Gradient Method (CG) algorithm as a leading example. Hotspots and bottlenecks of the complete solver are identified followed by the introduction of advanced performance optimization techniques like preconditioning and cache blocking.
Past
- Half-day on-site tutorial at SC24, Atlanta, Georgia, November 17-22, 2024 (Christie L. Alappat and Georg Hager, with Jonas Thies [TU Delft] and Hartwig Anzt [TU München])
- Half-day tutorial at ISC High Performance 2024, Hamburg, Germany, May 12-16, 2024 (Christie L. Alappat and Georg Hager, with Jonas Thies [TU Delft] and Hartwig Anzt [TU München])
Most HPC systems are clusters of shared memory nodes. To use such systems efficiently both memory consumption and communication time has to be optimized. Therefore, hybrid programming may combine the distributed memory parallelization on the node interconnect (e.g., with MPI) with the shared memory parallelization inside of each node (e.g., with OpenMP or MPI-3.0 shared memory).
This course analyzes the strengths and weaknesses of several parallel programming models on clusters of shared-memory nodes. Multi-socket-multi-core systems in highly parallel environments are given special consideration. MPI-3.0 has introduced a new shared memory programming interface, which can be combined with inter-node MPI communication. It can be used for direct neighbor accesses similar to OpenMP or for direct halo copies, and enables new hybrid programming models. These models are compared with various hybrid MPI+OpenMP approaches and pure MPI. MPI+OpenMP offloading with GPUs is also covered.
Numerous case studies and micro-benchmarks demonstrate the performance-related aspects of hybrid programming. Hands-on sessions are included on all days. Tools for hybrid programming such as thread/process placement support and performance analysis are presented in a “how-to” section.
This course is a joint training event of EuroCC@GCS and EuroCC-Austria, the German and Austrian National Competence Centres for High-Performance Computing. It is organized by the HLRS in cooperation with the VSC Research Center at TU Wien and NHR@FAU.
Upcoming
- Three-day hybrid tutorial at High Performance Computing Center Stuttgart (HLRS), Stuttgart, Germany, January 21-23, 2025 (Georg Hager, with Rolf Rabenseifner [HLRS] and Claudia Blaas-Schenner [TU Wien]).
Past
- Three-day hybrid tutorial at High Performance Computing Center Stuttgart (HLRS), Stuttgart, Germany, January 23-25, 2024 (Georg Hager, with Rolf Rabenseifner [HLRS] and Claudia Blaas-Schenner [TU Wien]).
- Three-day online PRACE tutorial at Vienna Scientific Cluster (VSC), TU Wien, Austria, December 12-14, 2022 (Georg Hager, with Rolf Rabenseifner [HLRS] and Claudia Blaas-Schenner [TU Wien]).
- Three-day online PRACE tutorial at LRZ Garching, Germany, June 22-24, 2022 (Georg Hager, with Rolf Rabenseifner [HLRS] and Claudia Blaas-Schenner [TU Wien]).
- Three-day online PRACE tutorial at Vienna Scientific Cluster (VSC), TU Wien, Austria, April 5–7, 2022 (Georg Hager, with Rolf Rabenseifner [HLRS] and Claudia Blaas-Schenner [TU Wien]).
- Three-day online tutorial at Vienna Scientific Cluster (VSC), TU Wien, Austria, June 15–17, 2021 (with Rolf Rabenseifner [HLRS] and Claudia Blaas-Schenner [TU Wien]).
- Three-day online tutorial at Vienna Scientific Cluster (VSC), TU Wien, Austria, June 17–19, 2020 (with Rolf Rabenseifner [HLRS], Irene Reichl, and Claudia Blaas-Schenner [TU Wien]).
- Two-day tutorial at High Performance Computing Center Stuttgart (HLRS), Stuttgart, Germany, January 27–28, 2020 (with Rolf Rabenseifner [HLRS], Irene Reichl, and Claudia Blaas-Schenner [TU Wien]).
This full day tutorial covers different approaches to extend single-GPU programs to utilize multiple GPUs within a single compute node. It focuses on distributing work onto multiple accelerators, optimization techniques such as overlapping computation and CPU-GPU data transfers, and using Nsight Systems to analyze execution behavior and performance.
Upcoming
- Full-day online course, part 2 of From Zero to Multi-Node GPU Programming, March 19, 2025 (in collaboration with NHR@TUD)
- Full-day online course, part 4 of GPU Programming Workshop, February 6, 2025 (in collaboration with LRZ)
Past
- Full-day online course, part 2 of From Zero to Multi-Node GPU Programming, September 25, 2024 (in collaboration with NHR@TUD)
- Full-day online course, part 1 of Multi-GPU Programming with CUDA C++, April 5, 2024
- Full-day online course, February 8, 2024
This full day tutorial extends on that methodology of the course ‘Accelerating CUDA C++ Applications with Multiple GPUs’ by introducing techniques for multiple nodes as well as more advanced application examples. Special focus will be put onto using MPI and NVSHMEM for distributing workloads.
Upcoming
- Full-day online course, part 3 of From Zero to Multi-Node GPU Programming, March 26, 2025 (in collaboration with NHR@TUD)
Past
- Full-day online course, part 3 of From Zero to Multi-Node GPU Programming, October 2, 2024 (in collaboration with NHR@TUD)
- Full-day online course, part 2 of Multi-GPU Programming with CUDA C++, April 10, 2024
- Full-day online course, February 9, 2024
HPC Tools
Porting code to the GPU can offer significant speedups, but it often comes with challenges. This course introduces NVIDIA’s profilers as tools to identify common performance issues that arise during the porting process. Performance analysis will be guided by simple, resource-based models, helping developers assess how far the performance is from the “target.”
Upcoming
- Full-day online course, April 11, 2025
Past
- Performance Analysis on GPUs with NVIDIA Tools, half-day online course, October 9, 2024
- GPU Performance Analysis. Lecture at the International HPC Summer School (IHPCSS), Kobe, Japan, July 7–12, 2024.
- Performance Analysis on GPUs with NVIDIA Tools, half-day online course, March 19, 2024.
- Performance Analysis on GPUs with NVIDIA Tools, half-day online course, October 10, 2023.
- GPU Performance Analysis. Lecture at the International HPC Summer School (IHPCSS), Atlanta, GA, July 9–14, 2023.
- Performance Analysis on GPUs with NVIDIA Tools, half-day online course, April 4, 2023.
- Performance Analysis on GPUs with NVIDIA Tools, half-day online course, September 29, 2022.
- GPU Performance Analysis. Lecture at the International HPC Summer School (IHPCSS), online, June 19–24, 2022.
- GPU Performance Analysis. Lecture at the International HPC Summer School (IHPCSS), online, July 18–30, 2021.
This workshop organized by VI-HPS and Erlangen National High Performance Computing Center will give an overview of the VI-HPS programming tools suite, explain the functionality of individual tools, and how to use them effectively and offer hands-on experience and expert assistance using the tools.
On completion participants should be familiar with common performance analysis and diagnosis techniques and how they can be employed in practice (on a range of HPC systems). Those who prepared their own application test cases will have been coached in the tuning of their measurement and analysis, and provided optimization suggestions.
Past
- Three-day online workshop at NHR@FAU, March 1–3, 2021.
- Three-day online workshop at CSC Frankfurt, December 7–11, 2020.
- Three-day online workshop at CINECA, Italy, September 30–October 2, 2020.
LIKWID stands for “Like I Knew What I’m Doing.” It is an easy to use yet powerful command line performance tool suite for the GNU/Linux operating system. While the focus of LIKWID is on x86 processors, some of the tools are portable and not limited to any specific architecture. For the upcoming release, LIKWID has been ported to ARMv7/v8 and POWER8/9 architectures as well as for Nvidia GPU co-processors.
Past
- Introduction to the LIKWID Tool Suite. Full-day online tutorial, July 23, 2024.
- Introduction to the LIKWID Tool Suite. Full-day online tutorial, July 24, 2023.
- LIKWID, OSACA, and Sparse MVM on A64FX. Webinar for Stony Brook University, July 27, 2021. Video recording
- Webinar: Using the LIKWID and OSACA tools on A64FX. June 2, 2021. Video
Past
- 2021 Code Performance Series: From analysis to insight. Online session on “Single-Node optimization,” July 15, 2021 Video recording
- EXA2PRO-EoCoE joint workshop, afternoon online session “Performance Engineering and code generation techniques”, February 23, 2021 Slides
Programming
The focus of this course is on the introduction of the essential language features and the syntax of C++. Additionally, it introduces many C++ software development principles, concepts, idioms, and best practices, which enable programmers to create professional, high-quality code from the very beginning.
The course aims at understanding the core of the C++ programming language, teaches guidelines to develop mature, robust, maintainable, and efficient C++ software, and helps to avoid the most common pitfalls. Attendees should have a grasp of general programming (in any language).
Past
- Six-day online course at NHR@FAU, September 12/13, 19/20, and 26/27, 2024.
- Six-day online course at NHR@FAU, September 14/15, 21/22, and 28/29, 2023.
- Five-day online course at NHR@FAU, October 10–14, 2022.
This advanced C++ training is a course on software development with the C++ programming language. The focus of the training are the essential C++ software development principles, concepts, idioms, and best practices, which enable programmers to create professional, high-quality code.
The course will give insight into the different aspects of C++ (object-oriented programming, functional programming, generic programming) and will teach guidelines to develop mature, robust, maintainable, and efficient C++ code.
Past
- Three-day online course at NHR@FAU, September 30-October 2, 2024.
- Three-day online course at NHR@FAU, October 11-13, 2023.
- Three-day online course at NHR@FAU, October 5–7, 2022.
Tutorials on Molecular Dynamics Simulations
This course covers an introduction into the molecular dynamics engine GROMACS, including fundamental commands and applications. Over five days, the participants will learn how to prepare and run simulations of biomolecular systems (e.g. including membranes and proteins) at an atomistic and coarse-grained level of resolution. Post-processing and analysis of simulation trajectories are a large part of the tutorial.
The course is usually embedded in the Bachelor programs of Biology and Integrated Life Sciences. There are five places available for people from NHR. The course will be held in person and takes place in the CIP of the Biology Department.
Interested candidates should send a short note about their background and motivation to rainer.boeckmann@fau.de.
Past
- Three-day online course at the Biology Department of FAU, October 10–12, 2023.
- Three-day online course at the Biology Department of FAU, December 12–16, 2022.