Support Success Stories
Background – During scalability tests of a framework performing spatial-temporal Bayesian modelling unexpected slowdown was observed for increasing numbers of GPUs/MPI processes.
Analysis – It could be observed that only some of the MPI processes exhibited much longer runtimes for comparable tasks while others seemed to be unaffected. The CPU and GPU affinity was examined for all MPI processes. It turned out that if too many MPI processes performing memory intensive operations, shared the same NUMA domain, they slowed down considerably.
Optimization – After studying the architecture of the GPGPU nodes it was possible to identify an affinity setup that significantly improves the performance of the implementation. The key steps included pinning MPI processes in such a way that the employed CPU cores are optimally connected to the specified GPU. Additionally, it was made sure that the memory intensive operations were equally distributed among the different NUMA domains.
Summary – The performance of the single as well as multi-process version was significantly improved by implementing a customized affinity pattern (Runtime & Speedup plot). It is chosen such that the memory bandwidth between the assigned GPUs and CPUs is maximized for each MPI process while also ensuring load balancing between NUMA domains.
Solving the mystery of the program that ran too fast
Background – The authors of GHODDESS, a software for the simulation of ocean circulation, noted a strange phenomenon: A strategically placed call to the function cuCtxCreate, a CUDA function used to initialize a GPU, right before buffer allocation, resulted in higher performance, even though the code did not actually make use of a GPU!
Analysis – A measurement with LIKWID uncovered a clue: the CUDA function call caused a lower L2 cache traffic and a higher L1 cache hit rate. An examination of the allocated buffer addresses revealed a lead: the no-CUDA version had huge differences between the addresses of the about 50 allocated buffers, whereas the memory allocation would have the allocations closely spaced, with no gaps, if preceded by the cuCtxCreate call.
Cache aliasing was the main suspect for the decreased L1 cache hit: The allocations all mapped the same index of different buffers into the same L1 cache set. Then, if a kernel touches more than eight different buffers, any further accesses exceed the 8x associativity of most modern CPU’s L1 cache. A custom allocator with the altered allocation behavior? Fast code! Round up the buffer allocation sizes to multiples of large powers of two? Slow code!
Optimization – We don’t know whether the cuCtxCreate call exchanges the malloc function of libc, alters its internal state through allocations, or causes some other side effect. We can ensure the better memory allocation behavior with a custom allocator that allocates the buffer from a contiguous block and pads the buffer sizes if they are multiples of large powers of two.
Summary – The improved buffer allocation resulted in speed ups of about 70% for the whole program, without the necessity to call any CUDA functions.
CUDA accelerates your code! (GPU optional)
Optimizing the update rule in the MARQOV software package
Background – The Metropolis update rule applied to an exponential probability distribution forms the backbone of numerous Markov chain Monte Carlo techniques across all sciences. In the context of the MARQOV software package we tried to optimize this basic building block further.
Analysis – In this context, the Metropolis update rule, r < e-dE, requires the comparison of a random real number r with the exponential of a real energy difference dE, and hence necessitates the evaluation of a special function, although both numbers often differ largely in magnitude.
Optimization – It turns out that this mismatch in the magnitudes of r and e-dE can be quantified and exploited for an optimization. By using the bit representations of real numbers x = m*2^c on a computer in terms of a mantissa m and an exponent c it can often be decided by comparing c in the representation of the random number r with a suitably scaled version of the energy difference dE and hence the exact evaluation of the exponential is often not necessary.
Summary – Using this trick the throughput of the update rule could be improved by approximately 50%. Applied to the real world example of simulations of the Heisenberg Model at the critical point with the Wolff-cluster update, in which the metropolis rule forms the kernel which is evaluated for a large amount of random numbers, this translates to a speedup of about 15%.
The throughput of the update rule improved by nearly 50%; the overall kernel speedup is about 15%.
Background – Simulations of a customer were identified in our job-specific performance monitoring because of very low resource utilization. The customer is using the Open Source Finite Element Software Elmer/Ice for his simulations.
Analysis – Together with the customer a relevant test case was defined and a performance analyst at RRZE analyzed the application performance. As it turned out to acquire a runtime profile for Elmer/Ice is non-trivial as the code uses a sophisticated on-demand library loading mechanism which confuses standard runtime profilers. An Intel contact person, that is specialized on Elmer, could give us advice how to enable runtime profiling in the code itself. Two hotspots were identified consuming a majority of the runtime. A code review revealed that one specific statement was very expensive.
Optimization – Together with the customer an alternative implementation of this statement was used which resulted in an overall speedup of x3.4 for our benchmark case. During the static code review together with the customer several simulation steps where identified that could be executed only at the end of the simulation.
Summary – This saving of algorithmic work together with the first optimization accumulated to a speedup of factor x9.92. The effort spent on our side was 1.5 days, where getting the code to compile and run took already roughly half a day.
Click here for more information.
Speedup of x9.92 with an effort of 1.5 days. Improvement by saving work.
Background – Our HPC group teamed up with researchers from Universität Regensburg for analysis and performance engineering of the bio informatics code SCITE. Developed at ETH Zürich, SCITE stochastically searches for the evolutionary history of mutations in a population of tumor cells.
Analysis – The appearance of some functions are indicative for two typical performance bugs: Malloc/free for frequent temporary buffer allocation/deallocation and memmove/memset for std::vector copies due to pass-by-value. After fixing the performance bugs, the real culprit appears: a function making extensive use of the exponential function.
Optimization – Allocating the temporary buffers ahead of the using loop and reusing them reduces memory management overhead. Passing arguments by-reference eliminates the vector copies.
In addition to vectorization, careful numerical analysis reveals two opportunities concerning the precision of the exponential function evaluation:
- The last operation in the computations using the exponential functions is a logarithm, which effectively discards some bits of the mantissa.
- Very disparate orders of magnitude allow to discard some exponential function evaluation results that would be absorbed by larger summands anyway.
The resulting exponential function approximation uses AVX2 intrinsics and requires only 2 cycles per function on a Skylake CPU, compared with 16 cycles for the unvectorized exponential function from the standard library.
Summary – Performance bugs in the form of unnecessary allocations and copies were identified and fixed. Combined with a AVX2 vectorized approximation for the extensively used exponential function, a speedup factor 5 could be realized.
After fixing the performance bugs, the real culprit appears: […] exponential functions
Speeding up machine-learning on GPU accelerated Cluster nodes
Background – A user contacted us via help desk he suspects that his python based machine-learning calculations on TinyGPU are slowed down by scattered access to a very large HDF5 input file. In the ticket he already indicated putting the file on a SSD helped a lot on other systems.
Analysis – Most TinyGPU nodes have local SSDs installed. Unfortunately our documentation was a bit scattered on this topic. We improved the documentation to prevent similar problems in the future.
Optimization – Putting the input file on a SSD speeds up execution by factor 13. The user further optimized the access by using the python library h5py (read_direct routine), which prevents additional data copies, by a factor of 4.
Summary – The largest speedup is achieved by putting the data on a SSD. Optimizing the read access brings another factor 4 improvement.
Reduce file IO overhead by choosing the right hardware together with optimized data access.
Optimization of a granular gas solver
Background – In the scope of a KONWIHR project, an MPI-parallel granular gas solver was optimized to improve its performance and parallel efficiency.
Analysis – The single-core and multi-core performance of the code was analyzed using a simple test case. A runtime profile was generated to identify the most time-intensive parts of the simulation. Additionally, the LIKWID performance tools were used to measure hardware performance metrics like memory bandwidth and FLOP/s. In the multi-core case, the MPI communication pattern was analyzed using the Intel Trace Analyzer and Collector (ITAC) to determine the ratio of communication to computation for different numbers of processes.
Optimization – The runtime profile showed different possible optimizations of the code. Some functions were not inlined automatically by the compiler, but had to be forced by using a specific compiler flag. The computational costs of some calculations were reduced by avoiding divisions and by reusing already computed quantities. Additionally, some unnecessary allocation and deallocation of memory was identified. After including these optimizations, the code was able to run 14.5 times faster than the original version.
The analysis of the MPI communication behavior with ITAC revealed a share of 30% for communication between 4 processes, which increased further with increasing process number. More investigations on the code showed unnecessary data transfers. By sending only relevant data between processes, the parallel efficiency and performance were increased. For 216 cores, a simple test case was able to run 80% faster with an increase in parallel efficiency of 17% in comparison to the original code.
Summary – By using basic code analysis and optimizations, the runtime of the code was decreased by a factor of 14.5 on a single core. Additionally, a more efficient communication between the MPI processes was able to further decrease the communication overhead and the total runtime of the simulation.
Reduce runtime by factor of 14.5 by saving work.
Optimization of soft-matter simulation package on top of LB3D
Background – A customer contacted us to help them optimizing their LB3D-based software package to simulate soft matter systems at a mesoscopic scale. The software package was rewritten prior to the request to an object-oriented paradigm with redesign of computationally intensive routines. In order to analyze the outcome, the customer wanted to integrate LIKWID’s MarkerAPI to measure specific code regions. The simulation commonly runs on Tier-1 systems like Hazel Hen at HLRS.
Analysis – The runtime profile showed a few hot functions where most execution time was spent. For analysis, we added MarkerAPI calls around the hot functions and ran some evaluations (FLOP rate, memory bandwidth and vectorization ratio). The vectorization ratio was rather low but the compiler got the proper flags for vectorization. Despite Fortran provides handy declarations to create a new array, the customers used ‘malloc()’ calls.
Optimization – The main data structure contains the arrays as pointers (allocated by C-wrappers) and the GNU Fortran compiler was not able to determine whether the allocated arrays are contiguous in memory or not, so refused to apply AVX2 vectorization. By adding the ‘contiguous’ keyword to the array declarations, the compiler successfully vectorized the hot functions.
Summary – In the one-day meeting with the customers, we did a hands-on on LIKWID measurements and how to read the results. Moreover, we analyzed code regions in the customers’ software package and found vectorization problems caused by a missing keyword. With the ‘contiguous’ keyword, the performance was increased by 20%. After the one-day meeting, the group continued working on their code resulting in a three-fold improvement in performance.
Click here for more information.
Given the extremely positive experience in using LIKWID […], we added it to our testing framework
Node-level performance optimization of flow solver
Background – The MPI-parallel finite-volume flow solver FASTEST-3D was optimized to increase single-node performance and scalability in the scope of a KONWIHR project. To calculate the turbulent flow in technical applications, a fine temporal and spatial resolution is necessary. These simulations can only be run on current high-performance compute clusters. Possible optimizations of the code were investigated to improve the overall performance and to use the computational resources more efficiently.
Analysis – A function profile of the original version of FASTEST-3D was established by the GNU profiler gprof. Additionally, basic hardware requirements like memory bandwidth were determined on function level by integrating the LIKWID Marker API into the code. The linear equation solver was identified as the most time-consuming part of the code for both serial and parallel execution.
Optimization – Since the equation system does not change in case of the explicit solution procedure, its coefficients have to be computed only once. This is also true for the ILU factorization of the matrix. In the original version of the code, these were computed for every iteration. Avoiding these unnecessary recalculations saves about 7% of computational time. Additionally, the equation system for the pressure correction is symmetric in the explicit case, which was not exploited in the original code version. The optimized code shows an improvement in runtime by about 5%. As a third step, single precision was used to solve the linear equation system. This optimization is beneficial for both implicit and explicit solution procedures and helps reduce the amount of data which has to be loaded during the solution process. The rest of the algorithm is still performed using double precision, which makes an additional data conversion necessary. The use of single precision inside the solver led to a reduction in runtime by 25%.
Summary – By combining information about the most time-consuming functions of the code with hardware metrics like memory bandwidth, the most profitable course of optimization was determined. Together with specific knowledge of the user in the area of solution procedures and internal data structures of the code, a total reduction in runtime by 40% on a single node could be achieved.
Click here for more information.
Single precision solver reduces run time by 25%
Optimization of MPI communication of flow solver
Background – The scalability of the MPI parallel flow solver FASTEST-3D was limited by a rigid communication infrastructure, which led to a dominance of MPI communication time even at low process counts. To achieve a good scalability for a large number of processes, improvements of the communication mechanisms were necessary.
Analysis – An analysis of the communication pattern was performed using Intel Trace Analyzer and Collector (ITAC). It was observed that more time was spent for communication than for computation. The parallel efficiency of the code was below the acceptable limit of 50% when using more than 8 compute nodes. The communication was based on blocking MPI receive and MPI send functions which varied in duration and lead to a partial serialization of the communication.
Optimization – The observed partial communication serialization could be overcome by using non-blocking point-to-point send/receive calls. The overall communication strategy and the internal data structures were reworked to ensure that all data is received by the correct target process. Scaling runs showed, that at a parallel efficiency of 50%, 8x to 10x speedup (depending on the problem size) in comparison to the original version could be achieved.
Summary – By using non-blocking MPI communication between processes, a large improvement of parallel scalability due to eliminating the partial communication serialization was achieved. The optimized code is now ready for massively parallel, strongly scaled simulations on current high-performance cluster platforms.
Click here for more information.
Large improvement in parallel efficiency for high process counts by using non-blocking MPI communication
Fixing load imbalance in flow simulations using Flow3D
Background – Flow3D is a CFD software for flow simulations. The simulation domain is distributed on various compute nodes.
Analysis – In the cluster-wide job monitoring some Flow3D jobs showed load imbalance between compute nodes. The imbalance is caused by the structure of the simulation domain and consequently some processes had a higher workload than others.
Optimization – The distribution of the domain was improved and the user was recommended to use another type of compute node or less compute nodes. The performance didn’t drop significantly with different node selection but the resources could be used more efficiently. This results in a cost reduction by 14000€ per year by investing only 1.5 hours of work.
Summary – By balancing the workload between compute nodes and using other/less compute nodes, the resources of the compute nodes could be used more efficiently
when executing flow simulations with Flow3D CFD package.
Cost reduction by 14000€ per year by investing only 1.5 hours of work
Fixing misconfiguration in job scripts
Background – A common mistake when submitting jobs on HPC clusters is to reuse old job scripts for different experiments without adjusting them to the current job requirements.
Analysis – The job monitoring revealed jobs that were requesting five compute nodes (each having 20 physical CPU cores) but the application ran only on a single CPU core.
Optimization – After the improvement of the job script, the performance was still the same while running only on a single core, thus saving resources. By further using a different type of compute nodes with higher single core performance but less CPU cores, the performance could be increased with a more efficient usage of the available resources.
Summary – By fixing misconfigurations in job scripts and moving to the optimal compute node type for the job, the performance was increased by reducing the
resource usage at the same time.
Performance was increased by reducing the resource usage at the same time
Inefficient resource usage by oversubscribing single nodes
Background – The jobs of a user showed an inefficient resource usage in the job monitoring. Some of the compute nodes executed more processes than physical CPU cores while others almost didn’t execute any CPU instructions or used any memory.
Analysis – The imbalances were caused by bad parameter selection at the job configuration.
Optimization – After fixing the job scripts, the workload was equally distributed among all compute nodes which caused a performance increase of roughly 15%. The saved core hours in the user’s contingent were invested by the user in additional computations.
Summary – Load imbalance among compute nodes was caused by bad parameter selection in the job scripts. Fixed job scripts distributing work equally result in a performance gain of 15%.
Fixed job scripts distributing work equally result in a performance gain of 15%
Fixing parallelization parameters
Background – Ab-initio simulations of materials are computationally very demanding. Almost all software distributions offer advanced parallelization options to enhance parallel efficiency and resource utilization. However, these options require knowledge of the implementation, the simulated system and the computational hardware – the user has to care about everything.
Analysis – Job monitoring revealed under utilization of compute resources, both in terms of execution units as well as available main memory.
Optimization – Following the guidelines of the VASP developers, we were able to adjust the parallelization options according to our hardware and the simulation of the customer. The customer achieved a speedup of 100% while using the same amount of nodes.
Summary – Taking full advantage of the parallelization options provided by VASP, the time to solution was reduced by a factor of 2 while using the same resources.
Proper selection of parallelization parameters lead to 2x speedup at constant resource usage
Fixing work distribution
Background – Advanced simulation software-packages also come with advanced data distributions. Unfortunately most packages do have a standard way of distributing data that may or may not fit to the underlying hardware. The user, however, is able to adjust the data distribution in most of the cases.
Analysis – Job monitoring revealed under utilization of compute resources, but a very high load in the MPI communication. It turned out that VASP used 15 MPI tasks, e.g. 15 cores for a single 3D-FFT, while the compute nodes offer 10 cores per socket, 20 in total. Additionally the customer was running the code with only 12 MPI tasks per node. This resulted in inter socket and inter node MPI all-to-all communication.
Optimization – Adjusting the number of MPI tasks involved in a 3D-FFT to the number of cores per socket constraint the MPI all-to-all traffic to a single socket. As the network was not a bottleneck anymore, the full set of 20 cores per node could be used.
Summary – Adjusting the default data distribution MPI traffic was reduced significantly and the utilization of the nodes was increased at the same time, leading to a reduced time to solution of 60%.
Overwriting the default data-distribution time to solution was reduced by 60%