Upgrade of Lustre servers to solve the last weeks problems 2019-07-(01-05) (clusters now UP again)

  • Posted on: 27 June 2019
  • By: torkel

The last two weeks we have had serious problems with PFS, the parallel file system. The cause to the problems was identified fairly quickly. All attempts to get a temporary fix in place over the summer have failed though.

We have therefore, in consultation with the vendor of the storage solution, decided to update the server software starting the morning of July 1. The update was originally planned to take place in the early autumn and contains a permanent fix to the problems we have seen.

The update is expected to take the whole week.

Accessing the data on the PFS filesystem will be not be possible during the upgrade. Make sure that any data you need during the update is saved somewhere else.

All systems, Kebnekaise and Abisko, including the login nodes, will be unavailable during the upgrade.

Note: There is no backup of the PFS filesystem. The data stored on PFS should not be affected by the upgrade, but there are always risks when doing major upgrades. Make sure that you have backups of important data stored on the PFS. 

 

*UPDATE 20190704 16:10*

The upgrade is now done and everything has been verified to work.

Login nodes have been opened and jobs are running again.

Updated: 2021-11-11, 13:50