Scheduled downtime of NHR@FAU HPC systems, August 13–15 – FINISHED

original source: https://freesvg.org/andrea-s-checkmark-on-circle-1

2024-08-13

There will be a scheduled downtime of all the HPC systems of NHR@FAU starting Tuesday, August 13 at 12:00 p.m. and lasting until Thursday, August 15 at 12:00 p.m.

As usual, Jobs that would collide with the downtime will be postponed until the downtime is over.

There will be work on most of the fileservers, thus the cluster frontends and /home/* will be unavailable most of the time.

UPDATE 2024-08-14 @ 17:45 – Maintenance work on all central files ervers has completed.

There is some more work to do, but we will resume batch processing on those compute clusters where maintenance has already finished soon.

major OS update on servers for /home/{atuin,janus,saturn,titan} hopefully allowing better detection of evil usage patterns
major OS & “GPFS” updates on servers for /home/{hpc,vault}
additional HDDs added to /home/atuin for more capacity
file system performance optimizations in particular for /home/atuin
removal of CEPH leftovers on Alex; extension of the workspace filesystem /anvme on Alex

UPDATE 2024-08-14 @ 18:30 – Regular batch processing on Fritz has been resumed

Work done specific to Fritz:

minor OS updates have been installed
Slurm has been updated from 23.02.x to 24.05.x
changes to the Lustre mount parameters
the workspace filesystem /anvme from Alex is now also available on Fritz

UPDATE 2024-08-14 @ 19:00 – Regular batch processing on Meggie has been resumed

Work done specific to Meggie:

minor OS updates have been installed
Slurm has been updated from 23.02.x to 24.05.x
changed Slurm configuration to always bill full nodes; (slurm.conf parameter OverSubscribe changed from NO to EXCLUSIVE; now identical to Fritz)

UPDATE 2024-08-15 @ 09:00 – Regular batch processing on Woody has been resumed

Work done specific to Woody:

minor OS updates have been installed
Slurm has been updated from 23.02.x to 24.05.x

UPDATE 2024-08-15 @ 09:10 – Regular batch processing on TinyFAT has been resumed

Work specific to TinyFAT:

all nodes have been reinstalled with minor OS updates
Slurm has been updated from 23.02.x to 24.05.x
all nodes have been reinstalled

UPDATE 2024-08-15 @ 09:20 – Regular batch processing on TinyGPU has been resumed

Work specific to TinyGPU:

all nodes have been reinstalled with minor OS updates
Slurm has been updated from 23.02.x to 24.05.x
update of the Nvidia drivers to 560-legacy

~~sbatch / salloc will no longer work; you have to use the all the while documented variants sbatch.tinygpu / salloc.tinygpu on the TinyGPU frontend (host tinyx).~~

~~Interactive sessions cannot be started (anymore) using srun ... --pty /bin/bash -l; use salloc.tinygppu instead.~~

UPDATE 2024-08-16 – previous Slurm behavior for TinyGPU for convenience has been restored

We restored the old behavior of sbatch / salloc defaulting to sbatch.tinygpu / salloc.tinygpu. However, the explicit names (as documented) are still recommended.
The srun command from above also work again. However, salloc.tinygppu is the recommended way for interactive sessions
The message srun: error: gres_job_state_unpack: no plugin configured to unpack data type 7696487 from job 876082. This is likely due to a difference in the GresTypes configured in slurm.conf on different cluster nodes. has been fixed.

UPDATE 2024-08-15 @ 10:00 – Regular batch processing on Alex has been resumed

Work specific to Alex:

minor OS updates have been installed
Slurm has been updated from 23.02.x to 24.05.x
changes to the Lustre mount parameters
update of the Nvidia drivers to 560-open
increase of the workspace filesystem /anvme

UPDATE 2024-08-16 – Tier3-Jupyterhub (via hub.hpc.fau.de)

The server providing hub.hpc.fau.de had to be reinstalled. All services, in particular the Tier3-Jupyterhub, should be available again since the evening of 2024-08-15.

In case of questions and problems, please contact hpc-support@fau.de.