Background – Simulations of a customer were identified in our job-specific performance monitoring because of very low resource utilization. The customer is using the Open Source Finite Element Software Elmer/Ice for his simulations.
Analysis – Together with the customer a relevant test case was defined and a performance analyst at RRZE analyzed the application performance. As it turned out to acquire a runtime profile for Elmer/Ice is non-trivial as the code uses a sophisticated on-demand library loading mechanism which confuses standard runtime profilers. An Intel contact person, that is specialized on Elmer, could give us advice how to enable runtime profiling in the code itself. Two hotspots were identified consuming a majority of the runtime. A code review revealed that one specific statement was very expensive.
Optimization – Together with the customer an alternative implementation of this statement was used which resulted in an overall speedup of x3.4 for our benchmark case. During the static code review together with the customer several simulation steps where identified that could be executed only at the end of the simulation.
Summary – This saving of algorithmic work together with the first optimization accumulated to a speedup of factor x9.92. The effort spent on our side was 1.5 days, where getting the code to compile and run took already roughly half a day.
Click here for more information.
Speedup of x9.92 with an effort of 1.5 days. Improvement by saving work.
Speeding up machine-learning on GPU accelerated Cluster nodes
Background – A user contacted us via help desk he suspects that his python based machine-learning calculations on TinyGPU are slowed down by scattered access to a very large HDF5 input file. In the ticket he already indicated putting the file on a SSD helped a lot on other systems.
Analysis – Most TinyGPU nodes have local SSDs installed. Unfortunately our documentation was a bit scattered on this topic. We improved the documentation to prevent similar problems in the future.
Optimization – Putting the input file on a SSD speeds up execution by factor 13. The user further optimized the access by using the python library h5py (read_direct routine), which prevents additional data copies, by a factor of 4.
Summary – The largest speedup is achieved by putting the data on a SSD. Optimizing the read access brings another factor 4 improvement.
Reduce file IO overhead by choosing the right hardware together with optimized data access.
Optimization of a granular gas solver
Background – In the scope of a KONWIHR project, an MPI-parallel granular gas solver was optimized to improve its performance and parallel efficiency.
Analysis – The single-core and multi-core performance of the code was analyzed using a simple test case. A runtime profile was generated to identify the most time-intensive parts of the simulation. Additionally, the LIKWID performance tools were used to measure hardware performance metrics like memory bandwidth and FLOP/s. In the multi-core case, the MPI communication pattern was analyzed using the Intel Trace Analyzer and Collector (ITAC) to determine the ratio of communication to computation for different numbers of processes.
Optimization – The runtime profile showed different possible optimizations of the code. Some functions were not inlined automatically by the compiler, but had to be forced by using a specific compiler flag. The computational costs of some calculations were reduced by avoiding divisions and by reusing already computed quantities. Additionally, some unnecessary allocation and deallocation of memory was identified. After including these optimizations, the code was able to run 14.5 times faster than the original version.
The analysis of the MPI communication behavior with ITAC revealed a share of 30% for communication between 4 processes, which increased further with increasing process number. More investigations on the code showed unnecessary data transfers. By sending only relevant data between processes, the parallel efficiency and performance were increased. For 216 cores, a simple test case was able to run 80% faster with an increase in parallel efficiency of 17% in comparison to the original code.
Summary – By using basic code analysis and optimizations, the runtime of the code was decreased by a factor of 14.5 on a single core. Additionally, a more efficient communication between the MPI processes was able to further decrease the communication overhead and the total runtime of the simulation.
Reduce runtime by factor of 14.5 by saving work.
Optimization of soft-matter simulation package on top of LB3D
Background – A customer contacted us to help them optimizing their LB3D-based software package to simulate soft matter systems at a mesoscopic scale. The software package was rewritten prior to the request to an object-oriented paradigm with redesign of computationally intensive routines. In order to analyze the outcome, the customer wanted to integrate LIKWID’s MarkerAPI to measure specific code
Analysis – The runtime profile showed a few hot functions where most execution time was spent. For analysis, we added MarkerAPI calls around the hot functions and ran some evaluations (FLOP rate, memory bandwidth and vectorization ratio). The vectorization ratio was rather low but the compiler got the proper flags for vectorization. Despite Fortran provides handy declarations to create a new array,
the customers used ‘malloc()’ calls.
Optimization – The main data structure contains the arrays as pointers (allocated by C-wrappers) and the GNU Fortran compiler was not able to determine whether the allocated arrays are contiguous in memory or not, so refused to apply AVX2 vectorization. By adding the ‘contiguous’ keyword to the array declarations, the compiler successfully vectorized the hot functions.
Summary – In the one-day meeting with the customers, we did a hands-on on LIKWID measurements and how to read the results. Moreover, we analyzed code regions in the customers’ software package and found vectorization problems caused by a missing keyword. With the ‘contiguous’ keyword, the performance was increased by 20%. After the one-day meeting, the group continued working on their code resulting in a three-fold improvement in performance.
Click here for more information.
It is possible to appreciate the more than three-fold improvement in performance
Node-level performance optimization of flow solver
Background – The MPI-parallel finite-volume flow solver FASTEST-3D was optimized to increase single-node performance and scalability in the scope of a KONWIHR project. To calculate the turbulent flow in technical applications, a fine temporal and spatial resolution is necessary. These simulations can only be run on current high-performance compute clusters. Possible optimizations of the code were investigated to improve the overall performance and to use the computational resources more efficiently.
Analysis – A function profile of the original version of FASTEST-3D was established by the GNU profiler gprof. Additionally, basic hardware requirements like memory bandwidth were determined on function level by integrating the LIKWID Marker API into the code. The linear equation solver was identified as the most time-consuming part of the code for both serial and parallel execution.
Optimization – Since the equation system does not change in case of the explicit solution procedure, its coefficients have to be computed only once. This is also true for the ILU factorization of the matrix. In the original version of the code, these were computed for every iteration. Avoiding these unnecessary recalculations saves about 7% of computational time. Additionally, the equation system for the pressure correction is symmetric in the explicit case, which was not exploited in the original code version. The optimized code shows an improvement in runtime by about 5%. As a third step, single precision was used to solve the linear equation system. This optimization is beneficial for both implicit and explicit solution procedures and helps reduce the amount of data which has to be loaded during the solution process. The rest of the algorithm is still performed using double precision, which makes an additional data conversion necessary. The use of single precision inside the solver led to a reduction in runtime by 25%.
Summary – By combining information about the most time-consuming functions of the code with hardware metrics like memory bandwidth, the most profitable course of optimization was determined. Together with specific knowledge of the user in the area of solution procedures and internal data structures of the code, a total reduction in runtime by 40% on a single node could be achieved.
Click here for more information.
Single precision solver reduces run time by 25%
Optimization of MPI communication of flow solver
Background – The scalability of the MPI parallel flow solver FASTEST-3D was limited by a rigid communication infrastructure, which led to a dominance of MPI communication time even at low process counts. To achieve a good scalability for a large number of processes, improvements of the communication mechanisms were necessary.
Analysis – An analysis of the communication pattern was performed using Intel Trace Analyzer and Collector (ITAC). It was observed that more time was spent for communication than for computation. The parallel efficiency of the code was below the acceptable limit of 50% when using more than 8 compute nodes. The communication was based on blocking MPI receive and MPI send functions which varied in duration and lead to a partial serialization of the communication.
Optimization – The observed partial communication serialization could be overcome by using non-blocking point-to-point send/receive calls. The overall communication strategy and the internal data structures were reworked to ensure that all data is received by the correct target process. Scaling runs showed, that at a parallel efficiency of 50%, 8x to 10x speedup (depending on the problem size) in comparison to the original version could be achieved.
Summary – By using non-blocking MPI communication between processes, a large improvement of parallel scalability due to eliminating the partial communication serialization was achieved. The optimized code is now ready for massively parallel, strongly scaled simulations on current high-performance cluster platforms.
Click here for more information.
Large improvement in parallel efficiency for high process counts by using non-blocking MPI communication