Success story: Simulation using Pardiso
Background
A collaborator from a large HPC center contacted us about an electrical network simulation code. The application consumes a lot of compute resources a and normally takes between 4 and 12 months to run one simulation. The code is heavily dependent on the PARDISO solver library and some BLAS-3 calls and is implemented in Fortran 77.
The simulation code has not been previously checked for performance and users do not know the hotspots. The initial target was to investigate better compiler switches. Any small improvement even in single core performance will have a significant impact in both cost and time of the entire simulation.
The baseline performance at the beginning of the analysis was 1.13 GFlop/s.
Overview
Testsystem
- CPU type: Intel Sandy Bridge E5-2680 @ 2.7 GHz
Software Environment
Compiler:
- Vendor: Intel
- Version: ifort (IFORT) 19.0.2
Difficulties
As hotspots were unknown we started with profiling. First we used gprof but for some reason it was not able to profile the code properly. Then we used Intel’s profiler for function profiling (use the compile-time flag – profile-functions), which gave us reliable profiling results.
Analysis
Profiling results:
Function Time (%) Self (%) Call count Main 100 2.72 1 blklu_unsym_risc_pardiso 93.52 70.98 100 mmpyi_pardiso 9.02 9.02 174771400 dgetc2_pardiso 5.70 5.70 2785050 scatt_pardiso 3.89 3.89 4779600
From profiling result it was clear the function blklu_unsym_risc_pardiso was using most of the time, where 70% was the self time and rest 23% mainly incurred from child functions: mmpyi_pardiso, dgetc2_pardiso and scatt_pardiso. Having a closer look at the function blklu_unsym_risc_pardiso it was clear that this is just a high level function with calls to some pardiso functions (stated above) and two BLAS-3 operations. But as MKL was used for BLAS-3 calls, the profiler couldn’t resolve for these BLAS calls and assigned it to the self time (70%) of blklu_unsym_risc_pardiso.
To have a clearer picture of the BLAS calls we wrapped all the BLAS calls in new functions, and called the wrapper functions from blklu_unsym_risc_pardiso.
The profile changed to:
Function Time (%) Self (%) Call count Main 100 2.72 1 wdtrsmm 39.35 39.35 55700800 blklu_unsym_risc_pardiso 93.52 17.59 100 wdgemm 12.66 12.66 6370300 mmpyi_pardiso 9.02 9.02 174771400 dgetc2_pardiso 5.70 5.70 2785050 scatt_pardiso 3.89 3.89 4779600
We now see the contribution of BLAS calls ‘dgemm and dtrsm’ through the wrappers ‘wdgemm and wdtrsm’. The first two hotspots for consideration is thus pinpointed to dtrsm (self time = 40%) and dgemm (self time = 12%). These functions are responsible for solving a dense triangular matrix equation and dense matrix-matrix multiply respectively.
Optimization
Optimization 1: LAPACK BLAS
A general look at the dimensions for which these BLAS operations are called gave us the first optimization opportunity. It turned out that for more than 95% of the time these BLAS operations were called with really small dimensions. It was known from our previous experiences for such dimensions MKL calls are not optimal due to its large amount of initial checks. So we substituted two of this calls with normal LAPACK BLAS calls available in www.netlib.org. This gave us an 8% performance improvement.
Current code performance : 1.22 GFlop/s
Optimization 2: Removing unwanted branches
We further stripped the BLAS-3 functions to have only minimum number of branches as required by the code. This gave us a performance boost of 17% compared to previous version.
Current code performance : 1.43 GFlop/s
Optimization 3: Mixture of LAPACK BLAS and MKL
However MKL calls are better for bigger dimensions, so in this step we did some naïve scanning to find a cut off region at which MKL calls better and called MKL versions for sizes bigger than this cut off. Doing this gave us again a slight speedup of 7%.
Current code performance : 1.54 GFlop/s
Optimization 4: Manual inlining
Doing this optimizations had changed the profile. Current profile is as follows:
Function Time (%) Self (%) Call count Main 100 4.44 1 blklu_unsym_risc_pardiso 89.50 24.48 100 wdgemm 17.81 17.81 6370300 mmpyi_pardiso 14.28 14.28 174771400 wdtrsm 12.45 12.45 6370300 dgetc2_pardiso 8.91 8.91 2785050 scatt_pardiso 6.22 6.22 4779600
Seeing the large amount of calls to mmpyi_pardiso from blklu_unsym_risc_pardiso we suspected that this might have high function call overheads. Therefore we had a look at mmpyi_pardiso function in PARDISO library, which was found to be a light weight kernel. So in order to remove function call overhead each time this function is called we manually inlined it.
As we had suspected this indeed boosted our performance by 17%.
Current code performance : 1.80 GFlop/s
Optimization 5: Interprocedural optimization
Owing to the success of our manual inlining of the single function we thought it would be beneficial to inline all the small function calls. Important kernels to be inlined was determined to be the BLAS-3 with small dimensions, and some PARDISO calls (like dgetc2_pardiso). As manual inlining would be a tedious task we looked for automatic alternatives as provided by the compiler. With the Intel compiler one could use the compiler flag -ipo to do such inlining. For PARDISO library calls we had to bring in the source code files for inlining.
This gave us a gain of 8.8%.
Current code performance : 1.96 GFlop/s
Optimization 6: Tuning interprocedural optimization
In order to make sure all the required functions are inlined we generated an optimization report by using the compile-time flag ‘-qopt-report=5 -qopt- report-phase=ipo’. Analyzing the report it was found some of the functions were not inlined since the compiler exceeded the default maximum settings (like max. file size). Based on the report we then changed the default setting of ‘-inline-max-size and -inline-max-total-size’ to inline all the required functions. This gave us again a performance improvement of about 4%. The final profile only contains blklu_unsym_risc_pardiso (>80% self time) as a major contributor as all other functions are inlined to this function.
Current code performance : 2.04 GFlop/s
Summary and Outlook
Code profiling helped us to identify relevant bottlenecks and optimize the code based on this bottlenecks. Optimizations yielded a total performance speedup of 1.8x (80%) compared to the baseline code (2.04 GFlop/s vs 1.13 GFlop/s). All these optimizations were done in a span of two days.
Contact