Scheduled downtime of NHR@FAU HPC systems, August 13–15 – FINISHED
There will be a scheduled downtime of all the HPC systems of NHR@FAU starting Tuesday, August 13 at 12:00 p.m. and lasting until Thursday, August 15 at 12:00 p.m.
As usual, Jobs that would collide with the downtime will be postponed until the downtime is over.
There will be work on most of the fileservers, thus the cluster frontends and /home/*
will be unavailable most of the time.
UPDATE 2024-08-14 @ 17:45 – Maintenance work on all central files ervers has completed.
There is some more work to do, but we will resume batch processing on those compute clusters where maintenance has already finished soon.
- major OS update on servers for
/home/{atuin,janus,saturn,titan}
hopefully allowing better detection of evil usage patterns - major OS & “GPFS” updates on servers for
/home/{hpc,vault}
- additional HDDs added to
/home/atuin
for more capacity - file system performance optimizations in particular for
/home/atuin
- removal of CEPH leftovers on Alex; extension of the workspace filesystem
/anvme
on Alex
UPDATE 2024-08-14 @ 18:30 – Regular batch processing on Fritz has been resumed
Work done specific to Fritz:
- minor OS updates have been installed
- Slurm has been updated from 23.02.x to 24.05.x
- changes to the Lustre mount parameters
- the workspace filesystem
/anvme
from Alex is now also available on Fritz
UPDATE 2024-08-14 @ 19:00 – Regular batch processing on Meggie has been resumed
Work done specific to Meggie:
- minor OS updates have been installed
- Slurm has been updated from 23.02.x to 24.05.x
- changed Slurm configuration to always bill full nodes; (
slurm.conf
parameterOverSubscribe
changed fromNO
toEXCLUSIVE
; now identical to Fritz)
UPDATE 2024-08-15 @ 09:00 – Regular batch processing on Woody has been resumed
Work done specific to Woody:
- minor OS updates have been installed
- Slurm has been updated from 23.02.x to 24.05.x
UPDATE 2024-08-15 @ 09:10 – Regular batch processing on TinyFAT has been resumed
Work specific to TinyFAT:
- all nodes have been reinstalled with minor OS updates
- Slurm has been updated from 23.02.x to 24.05.x
- all nodes have been reinstalled
UPDATE 2024-08-15 @ 09:20 – Regular batch processing on TinyGPU has been resumed
Work specific to TinyGPU:
- all nodes have been reinstalled with minor OS updates
- Slurm has been updated from 23.02.x to 24.05.x
- update of the Nvidia drivers to 560-legacy
sbatch
/ salloc
will no longer work; you have to use the all the while documented variants sbatch.tinygpu
/ salloc.tinygpu
on the TinyGPU frontend (host tinyx
).
Interactive sessions cannot be started (anymore) using srun ... --pty /bin/bash -l
; use salloc.tinygppu
instead.
UPDATE 2024-08-16 – previous Slurm behavior for TinyGPU for convenience has been restored
- We restored the old behavior of
sbatch
/salloc
defaulting tosbatch.tinygpu
/salloc.tinygpu
. However, the explicit names (as documented) are still recommended. - The
srun
command from above also work again. However,salloc.tinygppu
is the recommended way for interactive sessions - The message
srun: error: gres_job_state_unpack: no plugin configured to unpack data type 7696487 from job 876082. This is likely due to a difference in the GresTypes configured in slurm.conf on different cluster nodes.
has been fixed.
UPDATE 2024-08-15 @ 10:00 – Regular batch processing on Alex has been resumed
Work specific to Alex:
- minor OS updates have been installed
- Slurm has been updated from 23.02.x to 24.05.x
- changes to the Lustre mount parameters
- update of the Nvidia drivers to 560-open
- increase of the workspace filesystem
/anvme
UPDATE 2024-08-16 – Tier3-Jupyterhub (via hub.hpc.fau.de)
- The server providing hub.hpc.fau.de had to be reinstalled. All services, in particular the Tier3-Jupyterhub, should be available again since the evening of 2024-08-15.
In case of questions and problems, please contact hpc-support@fau.de.