Scheduled downtime of all our HPC systems from Sunday, June 08 until at least the evening of Tuesday, June 10

exclamation-rain

There will be a scheduled downtime of all the HPC systems of NHR@FAU
starting on Sunday, June 08 (Pentecost), at 6:00,
and lasting at least until the evening of Tuesday, June 10.

Main reason for the unusually long downtime is part 1 of the reconfiguration of /home/atuin (which hosts $WORK for NHR and BayernKI projects) in the hopes of fixing at least some of its longstanding problems. In addition, there will also be some general maintenance work on a bunch of our other systems on Tuesday.

For most users, jobs that would collide with the downtime will automatically be postponed until after the downtime. Frontends and fileservers (except /home/atuin) will be available on Sunday and Monday, but there will be some interruptions on Tuesday. /home/atuin will be unavailable Sunday through at least Tuesday.

Unfortunately, there is a handful of groups (less than 20) on /home/atuin who use the filesystem so badly that copying their data will simply not be feasible by Tuesday. These users will be notified separately, and their jobs will have to be cancelled and resubmitted manually once the copy for their files finishes, because we feel it would be unfair to let >1000 users wait for <10 filesystem “misusers”.

We will keep this post updated.

  • 2025-06-10 @ 18:10 – batch processing resumed on TinyGPU & TinyFAT
  • 2025-06-10 @ 18:20 – access to $WORK for groups b180dc and b165da will be delayed!
  • 2025-06-10 @ 18:20 – 17 NHR groups (b105dc, b109dc, b114cb, b118bb, b127dc, b129dc, b133ae, b136dc, b143dc, b146dc, b158cb, b162dc, b171dc, b188dc, b196ac, k101ee, k103bf) will have individual “downtimes” in the next days/weeks to finish their migration of $WORK
  • 2025-06-10 @ 18:35 – batch processing resumed on Alex
  • 2025-06-10 @ 18:40 – batch processing resumed on Fritz
  • 2025-06-10 @ 18:50 – batch processing resumed on Woody; some early jobs running on w25xx may have been aborted – sorry for that
  • 2025-06-10 @ 19:35 – batch processing resumed on Meggie
  • 2025-06-10 @ 19:40 – access to $WORK for group b180dc has been enabled
  • 2025-06-10 @ 20:00 – the login node tinyx just crashed; back online since 21:10
  • 2025-06-11 @ 07:10 – further issues with TinyGPU have been solved
  • 2025-06-11 @ 20:05 – access to $WORK for group b165da has been enabled
  • 2025-06-12 @ 08:00 – some accidentally “closed” $WORK directories have been “opened” and now have the original permissions
  • 2025-06-12 @ 10:00 – $WORK directories for new accounts are now created automatically again