Updates


Event Date Summary
The repair of the /scratch filesystem did not proceed as planned: 1) After the planned repair was competed none of the computenodes reconnected to the /scratch filesystem. At this point we had the choice between rebooting all nodes (and killing all running jobs in the process) and backing out the emergency fix. 2) We chose to revert the change, but this created an even bigger problem: creating files in /scratch failed. This was discovered after the scheduler had been restarted. Consequently a large number of jobs that started at this point and attempted to use /scratch failed. These jobs need to be resubmitted. 3) After trying to fix the problem (unsuccessfully) the original repair was reimplemented. We managed to remount the /scratch filesystem without rebooting the computenodes. The chosen procedure allows jobs that do not use /scratch to continue; it may even allow some jobs that use /scratch to continue. Other jobs that use /scratch may hang. Please check your jobs. We will run checks as well to identify jobs that are not progressing. 4) We expect that /scratch is now working as expected. We encourage anybody who is running into /scrach related problems to report them to support@computecanada.ca .
The /scratch filesystem will be unavailable for about one hour on Jan. 14 starting at 14:00 to resolve the problem with moving files from one directory to another wihtin /scratch. Processes (including running jobs) that access the /scratch filesystem during that time will hang. It is expected that all processes recover after the repair has been completed. The scheduler will be paused during the repair to prevent jobs from starting (and failing because accessing /scratch fails). We apologize for the inconvenient timing of the repair, but we depend on the availability of personnel from the supporting vendor to do the repair.

Incident description

System Incident status Start Date End Date
Cedar Closed
Created by Ali Kerrache on

Title


/scratch filesystem problem - Problème avec le système de fichiers /scratch


Summary


After the recent upgrade of the /scratch filesystem, some files cannot be moved. The files are there, but it is not possible to move them around. In addition, after further work on the filesystem, file creation is no longer possible. We are working to solve the issue as soon as possible. ====== Après la récente mise à jour du système de fichiers / scratch, certains fichiers ne peuvent pas être déplacés. Les fichiers sont là, mais il n'est pas possible de les déplacer en ce moment. De plus, suite à de nouveaux travaux sur le système de fichiers, il n’est plus possible de créer de nouveaux fichiers. Nous travaillons pour résoudre le problème le plus rapidement possible.


Updated by Martin Siegert on