GROMACS 2024.1 on brand-new GPGPUs

A close-up picture of super computer racks.

2024-08-13

The first patch for the newest GROMACS version 2024 was released end of February. Recently, several new GPGPUs were added to the test cluster at NHR@FAU. As we benchmarked our usual set of six simulation systems, we decided to write a follow-up on our GROMACS benchmark posts (Multi-GPU GROMACS Jobs on TinyGPU and GROMACS performance on different GPU types) due to some interesting results on the latest hardware.

The Benchmarks

R-143a in hexane (20,248 atoms) with very high output rate,
a short RNA piece with explicit water (31,889 atoms),
a protein inside a membrane surrounded by explicit water (80,289 atoms),
a protein in explicit water (170,320 atoms),
a protein membrane channel with explicit water (615,924 atoms), and
a huge virus protein (1,066,628 atoms).

For running these benchmarks with GROMACS 2024.1 on a GPU, the following command was executed:

$ gmx mdrun -v -s $benchmark.tpr -nb gpu -pme gpu -bonded gpu -update gpu -ntmpi 1 -ntomp 16 -pin on \
   -pinstride 1 -nsteps 200000 -deffnm $benchmark_name

This offloads all possible calculations to the GPU and pins the program threads to CPU cores; the number of OpenMP threads depends on how many CPU cores are available. Since several GPUs are combined with one host CPU in Alex, each GPU has access to only a fraction of the host, which on Alex is 16 cores. The flags ‑nsteps, ‑v, and ‑deffnm are used for convenient benchmarking purposes and are not recommended in production runs; especially ‑v should only be used for debugging.

The (new) accelerators at NHR@FAU

Name	Architecture	# CUDA cores	Power consumption	RAM	Memory bandwidth
H100	Hopper	16,896	700 W	80 GB	3.35 TB/s
GH200	Hopper	16,896	1,000 W	96 GB	4 TB/s
A100	Ampere	6,912	400 W	40 GB	1.5 TB/s
A40	Ampere	10,752	300 W	48 GB	696 GB/s
L40	Ada Lovelace	18,176	300 W	48 GB	864 GB/s
L40S	Ada Lovelace	18,176	350 W	48 GB	864 GB/s

Since GROMACS can only use the CUDA cores of the GPUs, the performance will correlate with that number; additionally, a higher power consumption usually means higher performance. The amount of RAM and memory bandwidth are no limiting factors for MD simulations of the listed benchmarks. Overall, the L40 and L40S should give the highest performance numbers due to the large number of CUDA cores.

The GPGPU benchmarks results

The results of the benchmarks with GROMACS 2024 on the new GPGPU hardware will be compared to the performance of the NVIDIA A40 and A100 GPGPUs that are both available in our Alex cluster. In particular, the A40 is ideal for MD simulation workloads due to its larger number of CUDA cores as compared to the A100.

[ns/day]	System 1	System 2	System 3	System 4	System 5	System 6
A100	267.5	691.7	265.8	129.1	38.8	23.2
A40	277.0	731.8	268.1	122.3	34.9	19.0

One of the reasons to switch to the latest GROMACS version are the continuous improvements in the code that by itself can already lead to slight performance increases. We had published a short article about that in the past. In fact, the performance increases from GROMACS 2023.2 on the A100 in Alex to the 2024 version are as follows:

System 1: 258.7 ns/day → 4%
System 2: 655.3 ns/day → 6 %
System 3: 260.3 ns/day → 1 %
System 4: 6ns/day → 2.0 %
System 5: 38.7 ns/day → 3 %
System 6: 23.2 ns/day → 0 %

Admittedly, the performance increases are not that big for the large systems but for the smaller ones it has an impact on overall simulation time. Moreover, switching from even older GROMACS versions can lead to larger performance improvements on the A100.

	GROMACS 2021	GROMACS 2022.3
System 1	184.8 ns/day → 44.8 %	241.7 ns/day → 10.7 %
System 2	613.2 ns/day → 12.8 %	646.8 ns/day → 6.9 %
System 3	241.2 ns/day → 10.2 %	248.8 ns/day → 6.8 %
System 4	122.8 ns/day → 5.1 %	124.7 ns/day → 3.5 %
System 5	37.6 ns/day → 3.2 %	37.0 ns/day → 4.9 %
System 6	21.9 ns/day → 5.9 %	22.2 ns/day → 4.5 %

Similar performance improvements can be observed on the A40, too. Why stick to an older code version when the new code has improved algorithms for the exact same simulation method and, thus, optimizes time to solution?

As we have previously seen, small simulation systems yield higher performance on the A40 GPUs as compared to the A100, whereas larger systems benefit from the increased memory bandwidth of the A100. The explanation is simple: the number of atoms in the small systems is too small to properly saturate the GPU and thus, the communication between GPU and CPU is performance limiting. The observation that performance decreases with the amount of output is evident from the comparison of System 1 and System 2: albeit similar in system size, the former benchmark generates the energy output with a frequency 333x greater than the latter one and thereby loses performance by a factor of 2.6.

[ns/day]	System 1	System 2	System 3	System 4	System 5	System 6
L40	416.1	1254.7	463.3	212.1	57.8	31.7
L40S	405.3	1219.7	472.4	229.2	68.3	37.9

Both the A40 and A100 GPU have successors in the Ada Lovelace and Hopper architecture, respectively. For the A40, there are two versions of the Ada Lovelace generation: the L40 and the L40S; both have nearly identical specifications (18,176 CUDA cores, 48 GB RAM, and a memory bandwidth of 864 GB/s) and only the power consumption differs slightly: 300 W for the L40 and 350 W for the L40S. In theory, the more power is consumed, the faster the GPU clock, which leads to a larger overall performance. However, this is not true for the two small benchmarks. These two systems are not big enough to fully occupy the GPU, and communication slows down the performance. Moreover, the first benchmark has such a large output frequency that—even though its size is 2/3—the performance is roughly only a third of system 2. The same observation can also be made for all other GPUs, which is why we recommend setting up the NVIDIA MPS server for system with less than 50,000 atoms[1].

[ns/day]	System 1	System 2	System 3	System 4	System 5	System 6
H100	354.4	1032.8	400.4	205.0	63.5	37.5
GH200	411.5	1123.5	470.6	232.6	75.0	44.4

The successors of the A100 are the H100 and GH200, where the latter is an H100 in combination with an ARM-based NVIDIA Grace CPU. Both new GPUs have the same number of CUDA cores (16,896) but the specs for the GH200 are slightly increased compared to the H100: 700 W power consumption, 80 GB RAM, and a memory bandwidth of 3.35 TB/s, whereas the GH200 can consume up to 1,000 W, has 96 GB RAM, and a memory bandwidth of 4 TB/s. These differences as well as the close connection between CPU and GPU in the GH200 might be reflected in the performance numbers, as those for the GH200 are slightly better than those for the H100.

To compare the performance of GPUs in relation to the recommended GPU for MD simulations, namely the A40, we normalized the numbers by dividing the performance result for each GPU by that of the A40.

$\frac{\boldsymbol{[}\text{\textsf{\textbf{ns}}}\boldsymbol{/}\text{\textsf{\textbf{day}}}\boldsymbol{]}}{\boldsymbol{[}\text{\textsf{\textbf{ns}}}\boldsymbol{/}\text{\textsf{\textbf{day}}}\boldsymbol{]}_{\text{\textsf{\textbf{A40}}}}}$	System 1	System 2	System 3	System 4	System 5	System 6
H100	1.28	1.41	1.49	1.68	1.82	1.97
GH200	1.49	1.54	1.76	1.90	2.15	2.33
A100	0.97	0.95	0.99	1.06	1.11	1.22
A40	1.00	1.00	1.00	1.00	1.00	1.00
L40	1.50	1.71	1.73	1.73	1.66	1.66
L40S	1.46	1.67	1.76	1.87	1.96	1.99

Now it becomes obvious that the small systems 1 and 2 and the smaller-sized medium system 3 should not be simulated on the A100 because the performance is lower than on the A40. In contrast, starting at approximately 150,000 atoms, it should be tested for each system if using the A100 is beneficial. Evidently, newer GPUs in combination with a recent GROMACS version increase performance significantly. However, it must be pointed out that the small systems can benefit more from the Ada Lovelace architecture than from the Hopper one and it should be vigorously tested if the performance gain is worth the extra procurement and maintenance cost of Hopper systems; more on that below.

For a more straightforward comparison of the GPUs that is independent from system size, we multiplied the performance by the number of atoms of the corresponding benchmark system. These numbers were then normalized to the A40 results and averaged.

$\frac{\boldsymbol{[}\text{\textsf{\textbf{ns}}}\boldsymbol{/}\text{\textsf{\textbf{day}}}\boldsymbol{]}\boldsymbol{}\text{\textsf{\textbf{atoms}}}}{\boldsymbol{[}\text{\textsf{\textbf{ns}}}\boldsymbol{/}\text{\textsf{\textbf{day}}}\boldsymbol{]\boldsymbol{}\text{\textsf{\textbf{atoms}}}}_{\text{\textsf{\textbf{A40}}}}}$	Average	Standard Deviation
H100	1.61	0.26
GH200	1.86	0.34
A100	1.05	0.10
A40	1.00	0.00
L40	1.67	0.09
L40S	1.79	0.20

What we can see here is that in terms of performance gain, the Hopper and Ada Lovelace generations are so similar that a different metric is needed to decide which hardware should be used for MD simulations.

The additional metric we are going introduce is performance per hardware costs: the performance numbers have been divided by the hardware costs, standardized to the A40, and averaged. Some of the following GPU prices were taken from Deltacomputer in April 2024 and some in June 2024; VAT has been added manually.

$\frac{\boldsymbol{[}\text{\textsf{\textbf{ns}}}\boldsymbol{/}\text{\textsf{\textbf{day}}}\boldsymbol{]}\boldsymbol{/}\text{\textsf{\textbf{costs}}}}{\boldsymbol{[}\text{\textsf{\textbf{ns}}}\boldsymbol{/}\text{\textsf{\textbf{day}}}\boldsymbol{]\boldsymbol{/}\text{\textsf{\textbf{costs}}}}_{\text{\textsf{\textbf{A40}}}}}$	Costs	Average
H100	€30,723.42	0.27
GH200	€29,750.00	0.32
A100	€8,644.16	0.62
A40	€5,087.25	1.00
L40	€7,168.56	1.18
L40S	€7,259.00	1.25

Evidently, the prices for GPUs from the Hopper generation are several times higher than for the Ada Lovelace GPUs, but the performance gain does not reflect that. This analysis also reveals that the A40, as it is available in our Alex cluster, shows good performance in combination with a comparatively low price tag.

In summary, we have shown that while the Hopper generation looks promising for MD simulations in terms of CUDA cores and performance results, it does not yield higher performance than the GPUs from the Ada Lovelace architecture. Taking procurement costs into account, the L40 and L40S from the Ada Lovelace generation are much more attractive platforms for GROMACS users.

[1] For small systems, we recommend the use of the NVIDIA MPS server for multiple instances of the same simulation on one GPU; the setup is similar to our example for multiple walkers. Please contact us for support.

Contact us

Email: hpc-support@fau.de