Best Short Paper Award at the SC20 PMBS Workshop
We are happy to announce that our paper “Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX” has been awarded the “Best Short Paper Award” at PMBS20, the “11th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems” at SC20. This paper has emerged from a collaboration with colleagues from the University of Regensburg, who run a Fujitsu FX700 machine.
The first step towards a good understanding of the performance features (and quirks) of a new CPU is to get a good grasp of its instruction execution resources and its memory hierarchy; connoisseurs know that these are the ingredients for ECM performance models of steady-state loops. We were able to show that the cache hierarchy of the A64FX is partially overlapping, mainly with respect to data writes. That’s a good thing. What’s not so good is that many instructions in the A64FX core have rather long latencies. For instance, the 512-bit Scalable Vector Extensions (SVE) floating-point ADD and FMA instructions take 9 cycles to complete, and horizontal ADDs across a SIMD register take even more, which means that sum reductions, scalar products, etc. can be very slow if the compiler doesn’t have a clue about modulo variable expansion. To add insult to injury, the core seems to have very limited out-of-order (OoO) capabilities, putting even more burden on the compiler.
As a consequence, sparse matrix-vector multiplication (SpMV) needs special care to get good performance (i.e, to saturate the memory bandwidth). In particular, you need a proper data format: Compressed Row Storage (CRS) just doesn’t cut it unless the number of nonzeros per row is ridiculously large. Our SELL-C-σ format is just the right fit as it supports SIMD vectorization and deep unrolling without much hassle. As a result, SpMV can easily exceed the 100 Gflop/s barrier for reasonably benign matrices on the A64FX, but you need almost all the twelve cores on each of the four ccNUMA domains – which means that any load imbalance will immediately by punished with a performance loss. Your run-of-the-mill x86 server chips are much more forgiving in this respect since load imbalance can be partially hidden by the strong memory saturation.
The SVE intrinsics code for all experiments can be found in our artifacts description at https://github.com/RRZE-HPC/pmbs2020-paper-artifact. Watch Christie’s and Jan’s presentation on YouTube:
C. L. Alappat, J. Laukemann, T. Gruber, G. Hager, G. Wellein, N. Meyer, and T. Wettig: Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX. Accepted for the 11th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS20). Preprint: arXiv:2009.13903