Support Success Stories
Background – The Applied Geology and Environmental Systems Modeling group of FAU created a numerical model predict discharge in karst aquifer systems [1] making use of Bayesian inversion with the active subspace method [2] for parameter optimization. The group applied for a KONWIHR project to optimize and parallelize the code. As a first optimization step, the group integrated the Numba just-in-time compiler.
Analysis – The analysis was done on the Woody cluster, more specifically a single core of a KabyLake systems running at 3.7 GHz. The test case for the LuKARS model consists of two compartments in a simulated karst aquifer system and took 0.03s to execute. But before the model can execute, the NUMBA compiler took 6.14s generate the required functions.
Support – The support team of NHR@FAU identified multiple temporary allocations in the innermost function of the model. After replacing these small allocations with scalar variables, the runtime of the code using the same inputs reduced to 0.01s with a compile time of 4.87s.
Besides the optimizations, the NHR@FAU support team recommended to parallelize inside the Python script to compile only once. This was done by the group.
Summary – Through data handling optimizations, the calculation of the numerical model got a speedup of 3.
[1] Bittner, D., Narany, T. S., Kohl, B., Disse, M., & Chiogna, G. (2018). Modeling the hydrological impact of land use change in a dolomite-dominated karst system. Journal of Hydrology, 567, 267—279.
[2] Teixeira Parente, M., Bittner, D., Mattis, S. A., Chiogna, G., & Wohlmuth, B. (2019). Bayesian calibration and sensitivity analysis for a karst aquifer model using active subspaces. Water Resources Research, 55(8), 7086—7107.
Background – The ab initio quantum chemistry program package ORCA is used for the numerical calculation of molecular frequencies of an organic molecule with 38 atoms using a double-hybrid functional with a triple-zeta basis set.
Analysis – Initial performance of one of these calculations took 76.4 hours to finish on one “Kaby Lake” node of our high-throughput cluster Woody. Performance on “Ice Lake” was even slower: 103.5 hours.
Support – The OpenBLAS library gets statically linked in ORCA and it falsely detected the underlying hardware. By setting the variable OPENBLAS_CORETYPE
to the correct architecture type, it was possible to speed up the simulation to finish in only 11 hours on one of the “Ice Lake” nodes in Woody.
Summary – The observation of strange differences in wall time between “newer” and “older” hardware led to an evaluation of the jobs’ roofline diagrams, unveiling unexpected performance behavior.
Performance gain through optimized library usage!
Background – One member from the Pattern Recognition Lab of FAU reached out to NHR@FAU because they observed severely fluctuating job runtimes that first resulted in a decrease of performance and ultimately lead to some job cancellations due to the 24-hour wall time limit on the clusters. The setup of their jobs had been painstakingly adjusted to ensure job finalization on an NVIDIA A100 within 13 hours on a presumedly smoothly running cluster and they wanted to know what was suddenly happening to their jobs.
Analysis – A first glance into the user’s directory revealed 120 GB data in form of 346,133 files that were continuously read from $HPCVAULT
during job runtime. It might not be obvious that this I/O pattern can influence performance since it is so common for AI applications but the file systems at NHR@FAU are shared by 1,000 users and if several jobs read millions of files individually over network, each of these jobs will suffer performance-wise because the central file servers cannot handle this number of I/O requests.
Support – The easiest way to solve this problem is by combining the single files into an archive with zip
or tar
(in this case, the resulting archive had 10 GB) and then extracting the data directly to the node-local storage $TMPDIR
via tar xf "$WORK/data.tar" –C "$TMPDIR"
. The archive extraction time was two hours on $HPCVAULT
but it only took 13 minutes to extract the same data to $TMPDIR
.
Summary – The same job that took 13 hours until completion on an NVIDIA A100 and 28 hours on an NVIDIA V100 due to reading data from $HPCVAULT
was now able to finish in only 12 hours on an NVIDIA V100 because the data was readily accessible on $TMPDIR
.
Performance gain of 130 % by providing data on
$TMPDIR
.
Background – For a fundamental understanding of diffusive mass transport in electrolyte systems, MD simulations with GROMACS are used to determine the molecular diffusion coefficients for mixtures of varying size, structure, and charge distribution over a broad temperature and composition range at macroscopic thermodynamic equilibrium. Compatible force fields are used to accurately describe the interactions and a special focus lies on the evaluation of electrostatic interactions for mixtures containing ions. Post-processing extracts thermophysical properties like density, mutual diffusivities, the thermodynamic factor, and viscosities from the generated trajectories.
In previous projects, the group had successfully used a setup in GROMACS 5.1.2, i.e. an eight-year-old program version that had been released in February 2016 and yielded a performance of 100 ns/day on five Intel Xeon 2660v2 “Ivy Bridge” nodes. To make use of the performance improvements both on CPU and GPU and the implemented advances in algorithms that are available in modern GROMACS version, NHR@FAU offered support for porting new simulations to up to date GROMACS versions and setups.
Analysis – Since the choice of force field, integration step time, and temperature and pressure settings was in line with the nature of the simulation system, several benchmarks were run to determine which combination of constraints—all atoms or H-bonds only—and thermostat—Nosé-Hoover or v-rescale—would give the best overall performance. The reasons for running four benchmarks were (1): some algorithms do not allow offloading to the GPU which ultimately leads to a performance decrease and (2): to determine whether the system is stable enough to rely on force field parameters and constraints on H-bonds only instead of adding constraints to all atomic bonds.
Support – A slight decrease in performance is expected due writing energies every 10 MD steps but necessary for accurate post-processing. The scaling on Fritz appears reasonable for all benchmarks (performance gain of 50% and 70% and a parallel efficiency of 0.75 and 0.6 on two and three nodes, respectively); however, the performance of the “v-rescale” benchmarks is higher on the A40 (759–775 ns/day) compared to a total of three Fritz nodes (645–720 ns/day). The performance of the “hbonds” benchmarks is higher compared to the “allbonds” benchmarks. For using the Nosé-Hoover thermostat on GPU, the update step had to be put back on the CPU and for GPU usage in general, the addition of communication-related environment variables was recommended.
Summary – The initial setup was already appropriate and could only be slightly tweaked by using a modern approach to constraints and an updated thermostat setting. The incredible performance improvements were reached by switching to the newest GROMACS version and using the latest hardware available at NHR@FAU.
A modern program version on a modern hardware led to an enormous performance increase of up to 600%.
Background – The Computational Magnetic Resonance Imaging (CMRI) lecture teaches computational imaging techniques for magnetic resonance imaging (MRI) through computer exercises. Recently, the size of the lecture has grown significantly, so that the department is unable to provide local computer access for all students. Moreover, the variety of operating systems, hardware, and installed software on personal computers adds a complexity to the initial setup that can hardly be supported individually.
Analysis – Students need a unified development environment with integrated CPU and GPU resources for computer exercises and NHR@FAU can offer that via Jupyterhub.
Support – NHR@FAU created the virtual environments and installed the necessary dependencies to allow students to work on computer exercises using the unified development environment. During the exercise, CPU and GPU resources were blocked to ensure that students had access to appropriate resources without overloading.
Summary – The computer exercises for CMRI were successfully completed with the assistance of NHR@FAU’s supports.
Link to the full report Link to interview on YouTube
Background – Superconductors are exotic materials that carry electric current with zero resistance and have wide reaching applications in energy transport, storage, sensor technology, and quantum computing. What stops them from being used in everyday life is that they are only operational at very low temperatures or very high pressures. High-temperature superconductors are rare and, despite decades of research, still not fully understood. During a test project at NHR@FAU, the NHR@FAU team approached the scientists to offer help in optimizing their code as monitoring data showed a low vectorization ratio.
Analysis – The application makes heavy use of the data type complex_t
and micro-benchmarks showed that some computation could be improved by separating the complex number into real and imaginary part explicitly and store them in separate arrays. The code review inspired the scientist to look more closely at the integral operator. Through the application of the Fast Fourier Transform (FFT), the numerical effort can be reduced from N^2 \rightarrow N \log N .
Support – We assisted in improving vectorization and function inlining by the compiler as well as insights that influenced the application of FFTs.
Summary – Improving vectorization and transforming the problem into the frequency domain, the scientists achieve a significant speedup for a typical discretization of N=1024^2 of more than a hundredfold.
Background – In replica-exchange molecular dynamics simulations (REMD), the same system is simulated multiple times at different temperatures and the structures are frequently exchanged between the temperatures to enhance conformational sampling on the free energy landscape without getting trapped in local minima.
Analysis – An REMD simulation with GROMACS initially reached 124 ns/day for 26 replicas on 12 dual-socket Intel Ice Lake nodes (i.e. a total of 864 cores on “Fritz”). Estimated hardware costs: € 60,800.
Support – NHR@FAU was able to port this simulation to eight Nvidia A40 GPUs in one node on the GPGPU cluster (“Alex”). Since the number of replicas in this case was not a multiple of eight-GROMACS can handle those itself-porting was non-trivial and required PP- and PME-tasks to be assigned to the GPUs by hand.
Summary – Switching to GPGPUs yielded a performance of 120.6 ns/day with estimated hardware costs of € 48,700.
Retaining performance & reducing hardware costs by 2/3!
Background – Physical geographers/Climate scientists of the FAU optimized and ran a regional climate model for the South Island of New Zealand as part of the NHR project “ATMOS”. The modelling was done using the state-of-the-art Weather and Research Forecasting (WRF) model and contains 16 years of data (2005—2020). This model is especially interesting for its good spatial and temporal resolution. The created dataset will help the scientists to study the relationships between the ocean around New Zealand and climate and glacier mass changes in the high altitudes of the Southern Alps.
Given the fact that the climate of the Southern Hemisphere has historically been understudied compared to the Northern Hemisphere, the dataset provides a unique and valuable tool for investigations of climate change and related impacts in southern New Zealand with high interdisciplinary relevance.
Analysis – Publishing such a dataset for use by other scientists and interested parties is worthwhile because of its uniqueness. However, a globally accessible repository that can handle datasets of at least 0.5 TB must be used to make sure that the data is available in a FAIR (Findability, Accessibility, Interoperability, Reuse) way (https://www.go-fair.org/fair-principles/).
Support – We assisted in selecting a suitable archive and provided support using the Swiftbrowser on our HPC systems.
Summary – The data were published at the World Data Centre for Climate (WDCC), a service by DKRZ (Deutsches Klimarechenzentrum GmbH). They can be found at https://www.wdc-climate.de/ui/project?acronym=NZ-PROXY.
This is the first FAIR data publication from calculations done on a system of NHR@FAU.
Background – When setting up large simulations in GROMACS, using multiple GPUs can be beneficial for performance because the required work is divided among several workers. However, there is a multitude of command line flags available for GROMACS and the handling of optimized parallelization is often not straightforward.
Analysis – The initial setup by the user yielded 11.8 ns/day on eight NVIDIA A40 GPUs in one node on our GPGPU cluster Alex.
Support – Choosing appropriate environment variables for improved communication between GPUs and adjusting the runtime parameters on the hardware got an optimized performance of 20 ns/day on four A40 GPUs.
Summary – Wrong usage of GROMACS’s command line flags for parallelization can lead to a decrease in performance. Thus, NHR@FAU recommends contacting the support team when handling large simulations.
Twice the performance on half of the resources!
Background – FAU’s video portal recordings have not been accessible to the deaf, the hard of hearing, and non-native speakers due to the low quality of automatic captions. In autumn 2022, OpenAI released its new automatic speech recognition system called “Whisper” which was pre-trained on multilingual and multitask supervised data.
Analysis – Whisper uses PyTorch and the automatic speech recognition model can be executed on CPUs as well as GPUs.
Optimization – Using a small fraction of NHR@FAU’s GPGPU cluster Alex, all 40,000 media of audio and video recordings have been processed in approximately one day. A total of 2,500 GPU hours was sufficient to transcribe all these recordings with the “medium” Whisper model.
Summary – The automatically generated transcripts will soon become available on FAU.tv. In the long term, a full text search capability based on these transcripts shall be added to FAU’s video portal, and thus, takes the FAU a big step forward to make its video portal more accessible.
The overall execution was more than 15 times faster than real time.
Background – Classical simulations of moderately-sized quantum circuits for Hamiltonian simulation showed poor performance in the cluster monitoring, with strongly fluctuating FLOPS and vectorization values.
Analysis – Profiling data indicated that most of the compute time is spent on translating the quantum circuit defined with the python library Cirq to the external simulator qsim. It turned out that this happens because most gates in the circuit are first getting decomposed, despite the fact that they are supported by qsim. The issue is that the gates are defined as instances of a more general type of gate, which has no qsim equivalent.
Optimization – When the circuit is constructed using more specialized quantum gates, the translation from Cirq to qsim becomes trivial and its contribution to the runtime negligible. Except for the avoided gate decompositions, this does not change the simulated quantum circuit, only the way it is represented in Cirq.
Summary – With small changes to the code, the overhead due to interfacing Cirq with qsim could be significantly reduced, leading to an overall speedup factor of 4x–5x.
Quantum simulations sped up by 4x–5x via optimized circuit construction.
Background – During scalability tests of a framework performing spatial-temporal Bayesian modelling unexpected slowdown was observed for increasing numbers of GPUs/MPI processes.
Analysis – It could be observed that only some of the MPI processes exhibited much longer runtimes for comparable tasks while others seemed to be unaffected. The CPU and GPU affinity was examined for all MPI processes. It turned out that if too many MPI processes performing memory intensive operations, shared the same NUMA domain, they slowed down considerably.
Optimization – After studying the architecture of the GPGPU nodes it was possible to identify an affinity setup that significantly improves the performance of the implementation. The key steps included pinning MPI processes in such a way that the employed CPU cores are optimally connected to the specified GPU. Additionally, it was made sure that the memory intensive operations were equally distributed among the different NUMA domains.
Summary – The performance of the single as well as multi-process version was significantly improved by implementing a customized affinity pattern (Runtime & Speedup plot). It is chosen such that the memory bandwidth between the assigned GPUs and CPUs is maximized for each MPI process while also ensuring load balancing between NUMA domains.
Multi-GPU scalability massively improved by optimized MPI and memory affinity.
Background – The authors of GHODDESS, a software for the simulation of ocean circulation, noted a strange phenomenon: A strategically placed call to the function cuCtxCreate, a CUDA function used to initialize a GPU, right before buffer allocation, resulted in higher performance, even though the code did not actually make use of a GPU!
Analysis – A measurement with LIKWID uncovered a clue: the CUDA function call caused a lower L2 cache traffic and a higher L1 cache hit rate. An examination of the allocated buffer addresses revealed a lead: the no-CUDA version had huge differences between the addresses of the about 50 allocated buffers, whereas the memory allocation would have the allocations closely spaced, with no gaps, if preceded by the cuCtxCreate call.
Cache aliasing was the main suspect for the decreased L1 cache hit: The allocations all mapped the same index of different buffers into the same L1 cache set. Then, if a kernel touches more than eight different buffers, any further accesses exceed the 8x associativity of most modern CPU’s L1 cache. A custom allocator with the altered allocation behavior? Fast code! Round up the buffer allocation sizes to multiples of large powers of two? Slow code!
Optimization – We don’t know whether the cuCtxCreate call exchanges the malloc function of libc, alters its internal state through allocations, or causes some other side effect. We can ensure the better memory allocation behavior with a custom allocator that allocates the buffer from a contiguous block and pads the buffer sizes if they are multiples of large powers of two.
Summary – The improved buffer allocation resulted in speed ups of about 70% for the whole program, without the necessity to call any CUDA functions.
CUDA accelerates your code! (GPU optional)
Background – The Metropolis update rule applied to an exponential probability distribution forms the backbone of numerous Markov chain Monte Carlo techniques across all sciences. In the context of the MARQOV software package we tried to optimize this basic building block further.
Analysis – In this context, the Metropolis update rule, r < e-dE, requires the comparison of a random real number r with the exponential of a real energy difference dE, and hence necessitates the evaluation of a special function, although both numbers often differ largely in magnitude.
Optimization – It turns out that this mismatch in the magnitudes of r and e-dE can be quantified and exploited for an optimization. By using the bit representations of real numbers x = m*2^c on a computer in terms of a mantissa m and an exponent c it can often be decided by comparing c in the representation of the random number r with a suitably scaled version of the energy difference dE and hence the exact evaluation of the exponential is often not necessary.
Summary – Using this trick the throughput of the update rule could be improved by approximately 50%. Applied to the real world example of simulations of the Heisenberg Model at the critical point with the Wolff-cluster update, in which the metropolis rule forms the kernel which is evaluated for a large amount of random numbers, this translates to a speedup of about 15%.
The throughput of the update rule improved by nearly 50%; the overall kernel speedup is about 15%.
Background – The application consumes a lot of compute resources a and normally takes between 4 and 12 months to run one simulation. The code is heavily dependent on the PARDISO solver library and some BLAS-3 calls and is implemented in Fortran 77.
Analysis – From profiling result it was clear the function blklu_unsym_risc_pardiso was using most of the time, where 70% was the self time and rest 23% mainly incurred from child functions. Having a closer look at the function blklu_unsym_risc_pardiso it was clear that this is just a high level function with calls to some pardiso functions (stated above) and two BLAS-3 operations. But as MKL was used for BLAS-3 calls, the profiler couldn’t resolve for these BLAS calls and assigned it to the self time (70%) of blklu_unsym_risc_pardiso.
Optimization – Optimization was done in six steps: (1) substitution of MKL calls with normal LAPACK BLAS calls wherever possible, (2) removal of unwanted branches, (3) keeping a mixture of LAPACK BLAS and MKL, (4) removal of function call overhead by manually inlined it, (5) interprocedural optimization using the compiler, and (6) tuning interprocedural optimization.
Summary – Code profiling helped us to identify relevant bottlenecks and optimize the code based on this bottlenecks.
Total performance speedup of 1.8x (80%) in a span of two days.
Background – Simulations of a customer were identified in our job-specific performance monitoring because of very low resource utilization. The customer is using the Open Source Finite Element Software Elmer/Ice for his simulations.
Analysis – Together with the customer a relevant test case was defined and a performance analyst at RRZE analyzed the application performance. As it turned out to acquire a runtime profile for Elmer/Ice is non-trivial as the code uses a sophisticated on-demand library loading mechanism which confuses standard runtime profilers. An Intel contact person, that is specialized on Elmer, could give us advice how to enable runtime profiling in the code itself. Two hotspots were identified consuming a majority of the runtime. A code review revealed that one specific statement was very expensive.
Optimization – Together with the customer an alternative implementation of this statement was used which resulted in an overall speedup of x3.4 for our benchmark case. During the static code review together with the customer several simulation steps where identified that could be executed only at the end of the simulation.
Summary – This saving of algorithmic work together with the first optimization accumulated to a speedup of factor x9.92. The effort spent on our side was 1.5 days, where getting the code to compile and run took already roughly half a day.
Speedup of x9.92 with an effort of 1.5 days. Improvement by saving work.
Background – Our HPC group teamed up with researchers from Universität Regensburg for analysis and performance engineering of the bio informatics code SCITE. Developed at ETH Zürich, SCITE stochastically searches for the evolutionary history of mutations in a population of tumor cells.
Analysis – The appearance of some functions are indicative for two typical performance bugs: Malloc/free for frequent temporary buffer allocation/deallocation and memmove/memset for std::vector copies due to pass-by-value. After fixing the performance bugs, the real culprit appears: a function making extensive use of the exponential function.
Optimization – Allocating the temporary buffers ahead of the using loop and reusing them reduces memory management overhead. Passing arguments by-reference eliminates the vector copies.
In addition to vectorization, careful numerical analysis reveals two opportunities concerning the precision of the exponential function evaluation:
- The last operation in the computations using the exponential functions is a logarithm, which effectively discards some bits of the mantissa.
- Very disparate orders of magnitude allow to discard some exponential function evaluation results that would be absorbed by larger summands anyway.
The resulting exponential function approximation uses AVX2 intrinsics and requires only 2 cycles per function on a Skylake CPU, compared with 16 cycles for the unvectorized exponential function from the standard library.
Summary – Performance bugs in the form of unnecessary allocations and copies were identified and fixed. Combined with a AVX2 vectorized approximation for the extensively used exponential function, a speedup factor 5 could be realized.
After fixing the performance bugs, the real culprit appears: […] exponential functions.
Background – A user contacted us via help desk he suspects that his python based machine-learning calculations on TinyGPU are slowed down by scattered access to a very large HDF5 input file. In the ticket he already indicated putting the file on a SSD helped a lot on other systems.
Analysis – Most TinyGPU nodes have local SSDs installed. Unfortunately our documentation was a bit scattered on this topic. We improved the documentation to prevent similar problems in the future.
Optimization – Putting the input file on a SSD speeds up execution by factor 13. The user further optimized the access by using the python library h5py (read_direct routine), which prevents additional data copies, by a factor of 4.
Summary – The largest speedup is achieved by putting the data on a SSD. Optimizing the read access brings another factor 4 improvement.
Reduce file IO overhead by choosing the right hardware together with optimized data access.
Background – In the scope of a KONWIHR project, an MPI-parallel granular gas solver was optimized to improve its performance and parallel efficiency.
Analysis – The single-core and multi-core performance of the code was analyzed using a simple test case. A runtime profile was generated to identify the most time-intensive parts of the simulation. Additionally, the LIKWID performance tools were used to measure hardware performance metrics like memory bandwidth and FLOP/s. In the multi-core case, the MPI communication pattern was analyzed using the Intel Trace Analyzer and Collector (ITAC) to determine the ratio of communication to computation for different numbers of processes.
Optimization – The runtime profile showed different possible optimizations of the code. Some functions were not inlined automatically by the compiler, but had to be forced by using a specific compiler flag. The computational costs of some calculations were reduced by avoiding divisions and by reusing already computed quantities. Additionally, some unnecessary allocation and deallocation of memory was identified. After including these optimizations, the code was able to run 14.5 times faster than the original version.
The analysis of the MPI communication behavior with ITAC revealed a share of 30% for communication between 4 processes, which increased further with increasing process number. More investigations on the code showed unnecessary data transfers. By sending only relevant data between processes, the parallel efficiency and performance were increased. For 216 cores, a simple test case was able to run 80% faster with an increase in parallel efficiency of 17% in comparison to the original code.
Summary – By using basic code analysis and optimizations, the runtime of the code was decreased by a factor of 14.5 on a single core. Additionally, a more efficient communication between the MPI processes was able to further decrease the communication overhead and the total runtime of the simulation.
Reduce runtime by factor of 14.5 by saving work..
Background – A customer contacted us to help them optimizing their LB3D-based software package to simulate soft matter systems at a mesoscopic scale. The software package was rewritten prior to the request to an object-oriented paradigm with redesign of computationally intensive routines. In order to analyze the outcome, the customer wanted to integrate LIKWID’s MarkerAPI to measure specific code regions. The simulation commonly runs on Tier-1 systems like Hazel Hen at HLRS.
Analysis – The runtime profile showed a few hot functions where most execution time was spent. For analysis, we added MarkerAPI calls around the hot functions and ran some evaluations (FLOP rate, memory bandwidth and vectorization ratio). The vectorization ratio was rather low but the compiler got the proper flags for vectorization. Despite Fortran provides handy declarations to create a new array, the customers used ‘malloc()’ calls.
Optimization – The main data structure contains the arrays as pointers (allocated by C-wrappers) and the GNU Fortran compiler was not able to determine whether the allocated arrays are contiguous in memory or not, so refused to apply AVX2 vectorization. By adding the ‘contiguous’ keyword to the array declarations, the compiler successfully vectorized the hot functions.
Summary – In the one-day meeting with the customers, we did a hands-on on LIKWID measurements and how to read the results. Moreover, we analyzed code regions in the customers’ software package and found vectorization problems caused by a missing keyword. With the ‘contiguous’ keyword, the performance was increased by 20%. After the one-day meeting, the group continued working on their code resulting in a three-fold improvement in performance.
Link to more information (PDF)
Given the extremely positive experience in using LIKWID […], we added it to our testing framework.
Background – The MPI-parallel finite-volume flow solver FASTEST-3D was optimized to increase single-node performance and scalability in the scope of a KONWIHR project. To calculate the turbulent flow in technical applications, a fine temporal and spatial resolution is necessary. These simulations can only be run on current high-performance compute clusters. Possible optimizations of the code were investigated to improve the overall performance and to use the computational resources more efficiently.
Analysis – A function profile of the original version of FASTEST-3D was established by the GNU profiler gprof. Additionally, basic hardware requirements like memory bandwidth were determined on function level by integrating the LIKWID Marker API into the code. The linear equation solver was identified as the most time-consuming part of the code for both serial and parallel execution.
Optimization – Since the equation system does not change in case of the explicit solution procedure, its coefficients have to be computed only once. This is also true for the ILU factorization of the matrix. In the original version of the code, these were computed for every iteration. Avoiding these unnecessary recalculations saves about 7% of computational time. Additionally, the equation system for the pressure correction is symmetric in the explicit case, which was not exploited in the original code version. The optimized code shows an improvement in runtime by about 5%. As a third step, single precision was used to solve the linear equation system. This optimization is beneficial for both implicit and explicit solution procedures and helps reduce the amount of data which has to be loaded during the solution process. The rest of the algorithm is still performed using double precision, which makes an additional data conversion necessary. The use of single precision inside the solver led to a reduction in runtime by 25%.
Summary – By combining information about the most time-consuming functions of the code with hardware metrics like memory bandwidth, the most profitable course of optimization was determined. Together with specific knowledge of the user in the area of solution procedures and internal data structures of the code, a total reduction in runtime by 40% on a single node could be achieved.
Single precision solver reduces run time by 25%.
Background – The scalability of the MPI parallel flow solver FASTEST-3D was limited by a rigid communication infrastructure, which led to a dominance of MPI communication time even at low process counts. To achieve a good scalability for a large number of processes, improvements of the communication mechanisms were necessary.
Analysis – An analysis of the communication pattern was performed using Intel Trace Analyzer and Collector (ITAC). It was observed that more time was spent for communication than for computation. The parallel efficiency of the code was below the acceptable limit of 50% when using more than 8 compute nodes. The communication was based on blocking MPI receive and MPI send functions which varied in duration and lead to a partial serialization of the communication.
Optimization – The observed partial communication serialization could be overcome by using non-blocking point-to-point send/receive calls. The overall communication strategy and the internal data structures were reworked to ensure that all data is received by the correct target process. Scaling runs showed, that at a parallel efficiency of 50%, 8x to 10x speedup (depending on the problem size) in comparison to the original version could be achieved.
Summary – By using non-blocking MPI communication between processes, a large improvement of parallel scalability due to eliminating the partial communication serialization was achieved. The optimized code is now ready for massively parallel, strongly scaled simulations on current high-performance cluster platforms.
Large improvement in parallel efficiency for high process counts by using non-blocking MPI communication.
Background – Flow3D is a CFD software for flow simulations. The simulation domain is distributed on various compute nodes.
Analysis – In the cluster-wide job monitoring some Flow3D jobs showed load imbalance between compute nodes. The imbalance is caused by the structure of the simulation domain and consequently some processes had a higher workload than others.
Optimization – The distribution of the domain was improved and the user was recommended to use another type of compute node or less compute nodes. The performance didn’t drop significantly with different node selection but the resources could be used more efficiently. This results in a cost reduction by 14000€ per year by investing only 1.5 hours of work.
Summary – By balancing the workload between compute nodes and using other/less compute nodes, the resources of the compute nodes could be used more efficiently when executing flow simulations with Flow3D CFD package.
Cost reduction by € 14,000 per year by investing only 1.5 hours of work.
Background – A common mistake when submitting jobs on HPC clusters is to reuse old job scripts for different experiments without adjusting them to the current job requirements.
Analysis – The job monitoring revealed jobs that were requesting five compute nodes (each having 20 physical CPU cores) but the application ran only on a single CPU core.
Optimization – After the improvement of the job script, the performance was still the same while running only on a single core, thus saving resources. By further using a different type of compute nodes with higher single core performance but less CPU cores, the performance could be increased with a more efficient usage of the available resources.
Summary – By fixing misconfigurations in job scripts and moving to the optimal compute node type for the job, the performance was increased by reducing the resource usage at the same time.
Performance was increased by reducing the resource usage at the same time.
Background – The jobs of a user showed an inefficient resource usage in the job monitoring. Some of the compute nodes executed more processes than physical CPU cores while others almost didn’t execute any CPU instructions or used any memory.
Analysis – The imbalances were caused by bad parameter selection at the job configuration.
Optimization – After fixing the job scripts, the workload was equally distributed among all compute nodes which caused a performance increase of roughly 15%. The saved core hours in the user’s contingent were invested by the user in additional computations.
Summary – Load imbalance among compute nodes was caused by bad parameter selection in the job scripts. Fixed job scripts distributing work equally result in a performance gain of 15%.
Fixed job scripts distributing work equally result in a performance gain of 15%.
Background – Ab-initio simulations of materials are computationally very demanding. Almost all software distributions offer advanced parallelization options to enhance parallel efficiency and resource utilization. However, these options require knowledge of the implementation, the simulated system and the computational hardware – the user has to care about everything.
Analysis – Job monitoring revealed under utilization of compute resources, both in terms of execution units as well as available main memory.
Optimization – Following the guidelines of the VASP developers, we were able to adjust the parallelization options according to our hardware and the simulation of the customer. The customer achieved a speedup of 100% while using the same amount of nodes.
Summary – Taking full advantage of the parallelization options provided by VASP, the time to solution was reduced by a factor of 2 while using the same resources.
Proper selection of parallelization parameters lead to 2x speedup at constant resource usage.
Background – Advanced simulation software-packages also come with advanced data distributions. Unfortunately most packages do have a standard way of distributing data that may or may not fit to the underlying hardware. The user, however, is able to adjust the data distribution in most of the cases.
Analysis – Job monitoring revealed under utilization of compute resources, but a very high load in the MPI communication. It turned out that VASP used 15 MPI tasks, e.g. 15 cores for a single 3D-FFT, while the compute nodes offer 10 cores per socket, 20 in total. Additionally the customer was running the code with only 12 MPI tasks per node. This resulted in inter socket and inter node MPI all-to-all communication.
Optimization – Adjusting the number of MPI tasks involved in a 3D-FFT to the number of cores per socket constraint the MPI all-to-all traffic to a single socket. As the network was not a bottleneck anymore, the full set of 20 cores per node could be used.
Summary – Adjusting the default data distribution MPI traffic was reduced significantly and the utilization of the nodes was increased at the same time, leading to a reduced time to solution of 60%.
Overwriting the default data-distribution time to solution was reduced by 60%.