Improved performance of a GPU-accelerated Bayesian inference framework

Improved performance of a GPU-accelerated Bayesian inference framework (USI, Switzerland)

Background

During scalability tests of a C++ framework performing spatial-temporal Bayesian modeling unexpected slowdown was observed for increasing numbers of GPUs/MPI processes. The framework is based on a methodology called integrated nested Laplace approximations (INLA)1 which offers computationally efficient and reliable solutions to Bayesian inference problems. INLA is applicable to a subclass of Bayesian hierarchical additive models and our framework is particularly tailored to data with spatial-temporal association. A large part of the algorithm consists of solving an optimization problem which, in every iteration, requires independent and thus parallelizable function evaluations. Due to the inherent parallelism, we expected (almost) ideal strong scaling up to the number of necessary function evaluations per iteration. The computational kernel operation of each function evaluation is a Cholesky factorization of a large block tridiagonal symmetric positive definite matrix and a subsequent forward-backward solve. A block-wise factorization of each matrix was implemented for GPU that requires large memory transfers between GPU and main memory for each supernode of the matrix.

2D projection of a spatial random field over the globe at a fixed time point
2D projection of a spatial random field over the globe at a fixed time point

Black circles: observations over time from randomly sampled station. Colorful lines: Quantiles of fitted linear predictor
Black circles: observations over time from randomly sampled station. Colorful lines: Quantiles of fitted linear predictor

Analysis

It could be observed that some of the MPI processes exhibited much longer runtimes for comparable tasks while others seemed to be unaffected. Moreover, despite all operations cause the same workload, there is variation in the performance at each kernel invocation.

Runtimes of Cholesky factorizations of precision matrix of conditional Latent parameters. Three out of nine MPI processes shown, each utilizing one GPU.
Runtimes of Cholesky factorizations of precision matrix of conditional Latent parameters. Three out of nine MPI processes shown, each utilizing one GPU.

The CPU and GPU affinity was examined for all MPI processes. With increasing count of MPI processes and thus GPUs performing memory intensive operations, also the runtime differences increased. The application used the default affinity of the underlying MPI implementation.

Optimization

System Topology of an Alex compute node in NPS4 mode

After studying the architecture of the GPGPU nodes on Alex it was possible to identify an affinity setup that significantly improves the performance of the implementation. The key steps included pinning the MPI processes in such a way that the employed hardware threads of the CPU are optimally connected to the specified GPU (same NUMA domains). Additionally, it was made sure that the memory intensive operations were equally distributed among the different NUMA domains.

Runtimes of Cholesky factorizations of precision matrix of conditional Latent parameters after affinity optimization. Three out of nine MPI processes shown, each utilizing one GPU.
Runtimes of Cholesky factorizations of precision matrix of conditional Latent parameters after affinity optimization. Three out of nine MPI processes shown, each utilizing one GPU.

Summary

The performance of the single as well as multi-process version was significantly improved by implementing a customized affinity pattern. It is chosen such that the memory bandwidth between the assigned GPUs and CPUs is maximized for each MPI process while also ensuring load balancing between NUMA domains.

The x-axis shows the number of GPUs and the y-axis the total runtime in seconds. The red crosses indicate examples of previous runtimes while the blue circles show the runtime after the pinning. The black dashed line indicates the ideal scaling taking the pinned single-GPU version as baseline.

The x-axis shows the number of GPUs and the y-axis the speed up over the pinned single-GPU version. The red crosses indicate examples of the speedup observed before improving the implementation while the blue circles show the runtime after the pinning. The black dashed line indicates the ideal scaling taking the pinned single-GPU version as baseline.

The NUMA and GPU topology of an Alex node (in NPS4 mode) is non-optimal as it connects 2 GPUs to one NUMA domain while leaving one “empty” (no GPU attached). For jobs scheduled on the GPU that gets the hardware threads assigned located in the empty domain, the access to system memory will be slower compared to the other GPU in the pair.

References