Parallel file system outage in Fritz cluster

checkmark-sunshine

Since 20:10 on January 30, the parallel Lustre file system ($FASTTMP) on the Fritz cluster is unavailable due to repeated server crashes. The issue is being investigated.

Update 2025-01-31 17:00: The issue turned out to be a faulty component in the Infiniband network of Fritz, the Lustre servers just were the first ones to react badly to it.

The bad component is no longer in service, and things have been stable since around 15:00 today. We consider the issues solved. Regular batch processing has now been resumed.

Jobs that started before 15:00 today may have experienced all sorts of problems, including but not limited to unavailabilty of the parallel filesystem $FASTTMP (under /lustre), broken Infiniband-connections and resulting MPI failures.

We apologize for the inconvenience.