Modern HPC clusters are hierarchical: distributed-memory parallelism connects nodes across a network, while each node exposes multiple sockets and cores for shared-memory parallelism. Exploiting this hierarchy efficiently requires combining programming models – typically MPI for inter-node communication and OpenMP (or MPI-3.0 shared memory) within each node. This course examines the motivations, design choices, and performance trade-offs of such hybrid approaches, covering MPI+OpenMP, MPI-3.0 shared memory, and MPI+OpenMP GPU offloading side by side.
Through case studies and targeted micro-benchmarks, participants explore the performance implications of process and thread placement, intra-node communication strategies, and halo exchange patterns on multi-socket, multi-core systems.
This is an advanced course intended for participants who are already proficient in both MPI and OpenMP. It pairs naturally with the NHR@FAU Introduction to Parallel Programming with MPI and Introduction to Parallel Programming with OpenMP courses.
Level: Advanced
Language: English (German upon request for bespoke courses)
Price and Eligibility: Refer to the registration page for each event (generally free of charge for members of academia from Europe).
Knowledge
- Solid experience with MPI programming (point-to-point and collective communication)
- Solid experience with OpenMP programming (parallel regions, loop parallelism, synchronization)
Technical
- A modern web browser or SSH client for accessing the HLRS cluster environment provided for the course
After completing this course, you will be able to:
- Explain the performance motivation for hybrid MPI+OpenMP programming on hierarchical HPC systems
- Implement hybrid parallel programs that combine MPI for inter-node communication with OpenMP for intra-node threading
- Exploit MPI-3.0 shared memory windows for direct neighbor access and efficient halo copies within a node
- Compare hybrid MPI+OpenMP, MPI-3.0 shared memory, and pure MPI implementations in terms of performance and programmability
- Optimize process and thread placement for multi-socket, multi-core architectures
- Apply hybrid MPI+OpenMP offloading strategies for GPU-accelerated workloads
- Use performance analysis tools to diagnose bottlenecks in hybrid parallel programs
- Motivation for hybrid programming: memory consumption, communication overhead, and the node hierarchy
- Hybrid MPI+OpenMP: programming model, synchronization strategies, and thread-safety levels
- MPI-3.0 shared memory: windows, direct neighbor access, and halo exchange without message passing
- Process and thread placement on multi-socket, multi-core systems
- Performance comparison: hybrid vs. pure MPI on representative benchmarks and case studies
- MPI+OpenMP offloading to GPU accelerators
- Performance analysis tools for hybrid programs
- 2026, Feb 10: full-day on-site course in Hybrid @ HLRS in collaboration with HLRS, ASC
- 2025, Jan 21: full-day on-site course in Hybrid @ HLRS in collaboration with HLRS, VSC
- 2024, Jan 23: full-day on-site course in Hybrid @ HLRS in collaboration with HLRS, VSC
- 2022, Dec 12: full-day online course in collaboration with VSC, PRACE, HLRS
- 2022, Jun 22: full-day online course in collaboration with LRZ, PRACE, HLRS, VSC
- 2022, Apr 5: full-day online course in collaboration with VSC, PRACE, HLRS
- 2021, Jun 15: full-day online course in collaboration with VSC, HLRS
- 2020, Jun 17: full-day online course in collaboration with VSC, HLRS
- 2020, Jan 27: full-day on-site course at HLRS in collaboration with HLRS, VSC
For an overview of all NHR@FAU courses, visit the course overview page.