Performance Tools and Libraries
A number of performance tools and libraries are being developed in our group.
LIKWID tool suite
LIKWID is a node-level tool suite and library for performance-aware developers. It features a collection of useful command-line tools for topology exploration, affinity control, hardware performance monitoring, hardware configuration, and microbenchmarking.
Main developer: Thomas Gruber
Kerncraft is a loop kernel analysis and performance modeling toolkit. It allows automatic analysis of loop kernels using the Execution Cache Memory (ECM) model and the Roofline model, and their validation via actual benchmarks. Kerncraft provides a framework for investigating the data reuse and cache requirements by static code analysis. In combination with the Intel IACA tool or our own OSACA tool (see below), kerncraft can give a good overview of both in-core and memory bottlenecks and use that data to construct predictive, white-box performance models. In case of stencil codes it can use its built-in layer condition analyzer to automatically generate tuning advice, i.e., determine favorable loop blocking factors in order to reduce the code balance. Kerncraft contains a python-based cache hierarchy simulator that is also available as a standalone tool.
Main developer: Julian Hammer
GHOST is the “General, Hybrid, and Optimized Sparse Toolkit.” It provides basic building blocks for computations with very large sparse or dense matrices. GHOST is being developed as part of the ESSEX project under the umbrella of the Priority Programme 1648: Software for Exascale Computing (SPPEXA) of the German Research Foundation (DFG). The library is able to deal with systems containing standard multicore CPUs, Nvidia GPGPUs, and Intel Xeon Phis, and supports heterogeneous parallelism across all three architectures in the same program. GHOST is running successfully on current post-petascale systems such as Oakforest-PACS at the University of Tokyo (Top500 #7 in June 2017) or Piz Daint at the Swiss National Supercomputing Center (CSCS) in Lugano (Top500 #3 in June 2017).
Main developer: Dominik Ernst (vorher Dr. Moritz Kreutzer)
The CRAFT library (Checkpoint/Restart and Automatic Fault Tolerance) provides an easy-to-use interface to checkpoint/restart and dynamic process recovery capabilities. Both of these features can be used independently as well as combined. CRAFT is being developed as part of the ESSEX project under the umbrella of the Priority Programme 1648: Software for Exascale Computing (SPPEXA) of the German Research Foundation (DFG).
Main developer: Faisal Shahzad
The Open Source Architecture Code Analyzer (OSACA) is a tool that can analyze assembly code and produce best-case (instruction throughput) and worst-case (critical path) runtime predictions assuming that the data is in the L1 cache. Such a tool is sorely needed for analytic performance modeling. Intel provides the Intel Architecture Code Analyzer (IACA) for free, but it is not open source and its development has been discontinued. OSACA can do some things that IACA cannot, such as, e.g., analyze non-compiled assembly code or extend its own database with new instructions. OSACA can also handle non-Intel architectures, such as AMD x86 and Marvell ThunderX2 (ARMv8).
Why such a tool? Analytic performance models, such as the ECM model, depend on an accurate assessment of in-core execution performance. You can either do that manually by code (source or assembly) inspection, or you can use a tool that knows the instruction set and the limitations of a particular microarchitecture. The data flow analysis must be done by someone else – again, it’s either your brain or, e.g., our Kerncraft tool.
Main developer: Jan Laukemann
The MachineState Python3 module and CLI application is a tool to document and compare hardware and software settings known to affect performance. For benchmarking and their comparison with other results, it is fundamental to describe the test system and its configuration. Compute systems today provide various knobs to control the runtime behavior like CPU/Uncore frequencies, hardware prefetchers or simply memory capacity at hardware level and settings like NUMA balancing, writeback workqueues and task scheduling at software/operating system level. Also the versions of commonly used libraries (compilers and MPI libraries) are important for reproduction of benchmarking results. The MachineState tool gathers all (known) knobs and presents the settings in a JSON document. For reproducibility, the once create state can be compared to the current state to ensure similarities with the original test system.
Main developer: Thomas Gruber