Unplanned cluster issues due to switchboard failure

  • Posted on: 15 March 2019
  • By: zao

2019-03-29:

Normal power routing restored to all nodes.

 

2019-03-28:

Repairs completed, switchboard powered up.

 

2019-03-25:

Due to component delivery delays the final steps of repair are postponed. The new date for completion is Wednesday 2019-03-27.

 

2019-03-20:

Replacement parts and cables are en route. We currently estimate installation and recertification of the switchboard to be finished by the end of Monday 2019-03-25.

 

2019-03-15:

Due to a switchboard failure parts of the HPC2N compute clusters are without power.

Jobs on Kebnekaise are affected by nodes throttling (slowing down) due to insufficient power, we are working on excluding those nodes so new jobs won't start on them.

All running jobs on Abisko were likely affected, as the central Infiniband switches were connected to the offending switchboard. Power has been rerouted, and new jobs should start properly once everything has recovered.

Both systems will have somewhat reduced capacity (approx 17% nodes down) until the switchboard can be repaired. There is a possibility that we will need a complete systems downtime for electricians to do the repair, updates to follow.

 

Updated: 2021-11-11, 13:50