ECM Performance Model

Introduction

The ECM (Execution-Cache-Memory) performance model is a resource-based analytic performance model. It can predict the runtime of straight-line code (usually an inner-most loop body) on a specific processor chip. Runtime predictions are based  on a maximum throughput assumption for instruction execution as well as data transfers. The model in its simplest form can be setup with pen and paper.

Description

The model decomposes the overall runtime into a sum of contributions. A processor cycle is the only time unit used. All runtime contributions are provided for a number of instructions required to process one cacheline worth of data.  This makes sense because the smallest unit of work  that can be transferred between memory hierarchy levels is a cacheline. For simple calculations bandwidths and performance are consistently  specified in cycles per cacheline.

Figure 1: Example Vector Triad (left column = bandwidth, black arrow = cacheline transfer, red arrow = write allocate cacheline transfer)

The primary runtime contribution is instruction execution. The model assumes maximum instruction throughput for the code under consideration on a specific micro-architecture, assuming that all data referenced by load/store instructions are already in the inner-most cache level.

All influences preventing maximum instruction throughput as:

  • Out-of-order limitations
  • Instruction fetch issues
  • Instruction decoding issues
  • Complex register dependency chains

are neglected. The model considers simple pipelining hazards and long latency instructions, though.

It has to be known (e.g. from static code analysis) what total data volume is transferred between memory hierarchy levels and what path cachelines are traveling to get from their initial location into L1 cache and back. It is assumed that all transfers are throughput limited,  perfect latency hiding has to be in place. Additional data transfers from cache conflicts are neglected (it is still possible to extend the model to account for additional transfers). This data transfer analysis results in additional contributions from every data path.

Example double precision vector triad (Figure 1)

for (int i=0; i<size; i++) {
    A[i] = B[i] + C[i] * D[i];
}

Machine specs

For this example we assume a processorrunning at 2.7 GHz which has the following specs: It can execute one 32-byte SIMD vectorized add and one 32-byte SIMD vectorized multiply instruction per cycle. The processor has a superscalar , out-of-order execution with a maximum instruction throughput of four per cycle. Furthermore it can execute one 32-byte wide load and one 16-byte wide store instruction per cycle. The bandwidth between L1 and L2 cache is 32bytes/cycle and the cacheline size is 64 byte. This results in a L1-L2 bandwidth of 2 cycles per cacheline transfer. The sustained memory bandwidth is 40 GB/s or 4.3 cycles per cache line transfer. The cache hierarchy is inclusive and uses a write-back policy for stores. Therefore for every cache line store miss an additional write allocate transfer has to be added.

Code specs

The full vector triad performs one multiply and one add and requires 3 DP loads and 1 DP store. We assume we use the 32-byte SIMD instruction set. This results in 2 multiply and 2 add instructions to process one cacheline. On the load/store side 6 loads and 2 stores are required.

ECM model

First we map the code on the machine specs. For the instruction execution one has to find the unit which takes the longest. The 2 multiply and 2 add instructions take 2 cycles as the processor can execute a multiply and add in the same cycle. The six 32-byte load instructions take six cycles and the two 32-byte stores take 4 cycles. As a result of the core execution analysis we conclude that this code is load throughput bound on this specific processor type. The core execution time to update one cacheline worth of work (equivalent to two 32-byte SIMD or 8 DP scalar iterations) is 6 cycles.

To get the runtime contribution for the L2-L1 transfers one needs to determine the number of transfered cachelines first. To update one cacheline worth of work we need to load 3 cachelines and store 1 cacheline. Because the architecture triggers a write-allocate transfer on a store miss we need to add one more cacheline transfer. This results in 5 cacheline transfers. The bandwidth is 2 cycles per cacheline transfer resulting in 10 cycles runtime contribution. The same data volume applies for the L2-MEM transfers, but the bandwidth is 4.3 cycles per cacheline resulting in 24 cycles runtime contribution.

Summing up all contributions gives a total runtime to update one cacheline worth of data of 40 cycles. One can convert this to more common performance metrics as e.g. flop rate by calculating (16 flops / 40 cycles) x 2.7 GHz = 1.08 GFlops/s. The resulting memory bandwidth can be calculated with (5 x 64bytes / 40 cycles) x 2.7 GHz = 21.6 GB/s. It already becomes apparent why one core only achieves roughly 50% of the saturated memory bandwidth: Because of the non-overlapping parts out of 40 cycles the memory bus is only utilized 24.

In the models initial form the total runtime is computed by summing up all contributions. This means that no overlap occurs between different runtime contributions. Every single runtime contribution provides a lightspeed estimate of performance, it cannot get better than optimal instruction throughput  and bandwidth bound data transfers. While providing correct results in some cases this simplest form of the model is still often too pessimistic and underestimates performance. This is due to the fact that overlap amongst runtime contributions takes place in most micro-architectures.

In the beginning two performance barriers where provided, one assuming no overlap and one with full overlap. In many cases the performance barriers are too far apart, though,  to make the model useful. In a next step we introduced heuristics providing a more accurate overlap prediction. It turned out as a useful heuristic to assume that load and/or stores are non-overlapping with any other transfer in the memory hierarchy.

Extension to multi-core chips

The extension to multi-core is straightforward. The one core performance is scaled linearly until a shared bottleneck in the processor design is hit. This may be a shared cache or memory interface. The ECM model can predict if a code is able to saturate the memory bandwidth and how many cores are required to do so.

Conclusion

The main advantage of using the ECM model is that apart of providing a performance estimate it also generates knowledge about different runtime contributions and their relation to each other. It prooved to be a very useful tool to understand observed performance and decide on optimisations for a wide range of codes, from simple streaming kernels, to various stencil codes, medical image processing kernels, and the Lattice Boltzmann method.

Current and future research focus

While the predictive power of the model is significantly improved using heuristics for overlap estimates it still occurs that the model underestimates performance. Introducing a better heuristic for overlap prediction is on-going research. Another focus is to come up with a similar analytic model for codes with data transfers that are partially latency bound.

Setting up the model can be tedious, especially with large loop bodies. We develop tools which help to get the required input to formulate the model in a easier fashionl:  Kerncraft and OSACA.

Selected publications about the theory and applications of the ECM model

  • J. Hofmann and D. Fey: An ECM-based Energy-efficiency Optimization Approach for Bandwidth-limited Streaming Kernels on Recent Intel Xeon Processors. In: IEEE Press (Ed.): Proceedings of the 4th International Workshop on Energy Efficient Supercomputing (E2SC), Salt Lake City, UT, USA, November 14, 2016, pp. 31-38. DOI: 10.1109/E2SC.2016.010
  • J. Hofmann, D. Fey, M. Riedmann, J. Eitzinger, G. Hager, and G. Wellein: Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors. Concurrency & Computation: Practice & Experience 29(9), e3921 (2017). Available online, DOI: 10.1002/cpe.3921.
  • H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Proc. ICS15, the 29th International Conference on Supercomputing, June 8-11, 2015, Newport Beach, CA. DOI: 10.1145/2751205.2751240. Recommended as a good starting point to learn about the ECM model.
  • M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience 28(7), 2295-2315 (2015). DOI: 10.1002/cpe.3489.
  • G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience 28(2), 189-210 (2016). First published online December 2013, DOI: 10.1002/cpe.3180.
  • J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop “Memory issues on Multi- and Manycore Platforms” at PPAM 2009, the 8th International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, September 13-16, 2009. Lecture Notes in Computer Science, Volume 6067, 2010, pp 615-624. DOI: 10.1007/978-3-642-14390-8_64.