Multi-GPU GROMACS Jobs on TinyGPU

Symbolic picture for the article. The link opens the image in a large view.

2021-06-18

Many of our users at FAU are aware that it is possible to allocate more than one GPU on a TinyGPU node. The reason behind running a calculation on more than one GPU is quite simply a hope for increased performance, since gathering the appropriate amount of data for research is often time consuming. So it may come as a surprise that using multiple GPUs for an MD simulation with GROMACS might actually worsen performance or, even if you get some more out of it, still be a waste of resources. But that strongly depends on the GROMACS version, how GROMACS is started, the GPUs used and most importantly, the configuration of the simulation.

The Setup and first Results

Let’s have a closer look at a standard all-atom MD simulation (protein in membrane) with explicit water (65,209 atoms in total) from one of our users. Command-line flags (e.g. -pme gpu, -bonded gpu) ensure that as many calculations as possible will be off-loaded to the GPU; in this case an NVIDIA RTX2080Ti.

Running on one GPU with GROMACS2019, we obtain about 134 ns/day; two GPUs yield up to 176 ns/day and four GPUs give us 230 ns/day. These results might look promising: by quadrupling the number of GPUs, the overall performance increases; after all, we get up to 96 ns/day (~72 %) more than with just one GPU.

What about using newer program versions?

Now let us switch to GROMACS2020 on the same hardware. One GPU: 229 ns/day, two GPUs: 188 ns/day, and four GPUs: 273 ns/day. Even more so with GROMACS2021; one GPU: 233 ns/day, two GPUs: 195 ns/day, and four GPUs: 294 ns/day.

The gist of this is that newer versions of GROMACS give us the same performance (ns/day) on one GPU as older versions on four. With the new versions, however, the speedup from one to four GPU is very limited, which is a waste of resources.

And with newer hardware?

On a single NVIDIA RTX3080, the same simulation with GROMACS2021 will give 266 ns/day; two GPUs give 203 ns/day, four GPUs yield 281 ns/day, and on eight GPUs we get 322 ns/day. Keeping the same input and GROMACS version, the performance on an NVIDIA A100/40GB with SXM4/NVlink is the following: one GPU: 332 ns/day, two GPUs: 247 ns/day and four GPUs: 334 ns/day.

It is fairly obvious that the newest generation of NVIDIA GPUs yield the best overall performance due to changes in the architecture. But again, the performance increase from one to multiple GPU is only about 21 % on the RTX3080—given the cost of the hardware, this does not make sense.

Why is this happening?

One part of the explanation is that the communication between multiple GPU is limited especially without NVlink. The GROMACS developers collaborated with NVIDIA engineers and implemented a so-called halo-exchange to increase performance with multiple GPU. The following environment variables need to be set with GROMACS2020 when using multiple GPUs:

export GMX_GPU_PME_PP_COMMS=true
export GMX_GPU_DD_COMMS=true
export GMX_GPU_FORCE_UPDATE_DEFAULT_GPU=true

It is not necessary to set these variables for GROMACS2021; they are already included and setting them explicitly might actually decrease performance again.

Secondly, the increase in performance from GROMACS2019 to GROMACS2020 is due to the possibility to additionally offload yet another part of the calculation to the GPU; up until GROMACS2019, the following calculations could be offloaded to GPU: non-bonded interactions, electrostatics using PME, and bonded interactions. With the release of GROMACS2020, the time-consuming calculation of “updates and constraints” (via the flag -update gpu) was also offloaded to GPU and by that, a large performance gain can be obtained. However, the offloading of “updates and constraints” to GPU does not work for all input configurations when using multiple GPUs, which leads to the observed slowdown for our user’s input.

Furthermore, additional code changes have been implemented in GROMACS2021 that are able to increase performance slightly. Most of the recent changes are about optimizing calculation loops to improve parallel performance; none of these changes impact basic MD functionality.

Are there cases at all that benefit from multiple GPUs?

This is a very tricky question. It looks like there might be some benchmarking inputs which are very different from what FAU’s typical users simulate that show a speedup from one to eight GPUs.

Let’s start looking at tests that we run on AMD MI100 GPGPUs which, however, are quite different from Nvidia GPGPUs and support in GROMACS for AMD GPUs is not yet as mature as for Nvidia GPUs.. We only have a few MI100 available in our test cluster, thus, these benchmarks were run during a limited external test drive.

STMV with Reaction Field-electrostatics: 1 GPU: 15 ns/day, 8 GPUs: 34 ns/day → gain: 127 %
STMV with PME-electrostatics: 1 GPU: 12 ns/day, 8 GPUs: 29 ns/day → gain: 142 %
Lignocellulose with RF-electrostatics: 1 GPU: 6 ns/day, 8 GPUs: 17 ns/day → gain: 183 %

Unfortunately, the speedup is not enough; we have to take the “parallel efficiency” into account. The parallel efficiency is the ratio of the speedup S(N), which quantifies how much faster we can compute with N devices instead of one, and the number of devices, N. For “perfect speedup” we have S(N)=N and thus S(N)/N=1. The parallel efficiency tells us which fraction of the resources is actually used for computation. For instance, if the efficiency is below 0.5, more than half the resources are wasted.

So for the STMV-RF benchmark, the speedup for eight GPUs is S(N) = S(8) = 2.27, but the parallel efficiency S(N)/N = S(8)/8 = 2.27/8 = 0.28 is lower than 0.5—a threshold that we are convinced is just about reasonable. The parallel efficiency for the STMV-PME benchmark is 0.3 and that for the Lignocellulose benchmark is 0.35. Therefore, even increases in performance that give a speedup of more than 50% can turn out to be a disappointment with regards to parallel efficiency.

Nvidia reported some results for the STMV and Lignocellulose benchmarks to us showing a speedup of 2.0-2.5 when going from 1 to 4 Nvidia A40 GPUs with PCIe only. On 8 A100 with NVlink Nvidia obtained with an efficiency of 0.7 impressing 120 ns/day for STMV with PME.

But unfortunately, STMV is not the type of simulation people from FAU typical run with GROMACS.

Conclusion

Up until now, we know of no (relevant) MD simulation input from FAU scientists that can really benefit from a multiple GPU setup. Coming back to our first example, were we could observe a speedup of 72% with four GPUs, the parallel efficiency is only 0.43 and so this “promising” result also turns out to be a waste of resources after all.

NHR@FAU recommendation: Before increasing GPU count, switch to newer hardware, but most importantly, use the newest version of the GROMACS simulation code provided as modules by our industrious admins and check if this already improves performance.

If you are unsure how to optimize your simulation setup or need help with benchmarking, please do not hesitate to let us know.

Also check the comprehensive information from the GROMACS developers and Nvidia, e.g., https://aip.scitation.org/doi/10.1063/5.0018516 and https://developer.nvidia.com/blog/creating-faster-molecular-dynamics-simulations-with-gromacs-2020/.