Updates

Event Date	Summary
	Update Oct. 31 The /home and /scratch conversion on Cedar has been completed and the system is fully available again. Here is a list of changes: - the /home filesystem has been migrated to new hardware: /home and /scratch used to reside on the same storage appliance. This was causing problems, because if the /scratch filesystem was responding slowly (usually because of jobs that put a high input/output load on the filesystem) this resulted in slow responses for the /home filesystem as well. Also, the fileservers which were shared between /home and /scratch. These servers often crashed and rebooted because of high load interrupting services on the headnodes. This dependency now no longer exists and everybody should enjoy a much more responsive work environment on the login nodes. - new file servers and disk controllers were installed for the /scratch filesystem - the disk subsystem remains the same. This should result in by far better performance of the /scratch filesystem and stability of the /scratch filesystem: the new file servers have about 4 times as much memory and 4 times as many cores. - the scheduling software (Slurm) was upgraded to the latest version. This upgrade was necessary because the version that was in used before the upgrade was no longer supported and did not receive even security updates anymore. We did encounter bugs with the latest version that led to the loss of jobs as described in the update from Oct. 29. We encourage all users who lost jobs because of these bugs to contact support@computecanada.ca. We will do our best (e.g., increase job priorities) to make up for the loss of productivity.
	Update, Oct. 31 /scratch remains unavailable today. Some hardware arrived late for the /scratch upgrades. All required hardware is on-site and provisioning is currently taking place. Access to the Cedar cluster is available if you need to suspend your jobs that might have been queued to start Oct. 31 and required /scratch
	Update, Oct. 29 After updating the scheduling software (Slurm) we encountered a so far undiscovered bug. This bug resulted in an incorrect amount of memory assigned to jobs: in most cases that amount was far lower than what was requested in the job submission. Consequently when the scheduler was started and such jobs were scheduled to run, these jobs almost immediately were terminated because they ran out of memory. The error messages related to these jobs should contain errors like "out of memory" or references to the OOM-killer. Since we discovered this problem on Oct. 28, we searched for ways to recover these failed jobs and to fix the problem. We were able to fix the problem for all jobs that are still in the queue. This includes all jobs that were put on hold because of the unavailability of the /scratch filesystem. However, we did not find a satisfactory method to recover the jobs that failed. At this point we decided to return the system back to service. For those of you whose jobs failed with out-of-memory errors these are the options: 1) Resubmit the jobs. The bug only affects jobs that were submitted before the update. Newly submitted jobs are safe. 2) Contact support@computecanada.ca and we will be able to provide you with the job scripts of the failed jobs (however, we do not know the name of the job script - the system stores all job scripts with the name "script"). In cases where all submission parameters were specified in the job script and no command-line parameters were used in the job submission, we may even be able to submit the jobs for you. Please, mention in your email whether you are interested in us pursuing this option. 3) If you are not sure how to proceed, please email support@computecanada.ca. In any case we offer to increase the priority of such resubmitted jobs. Please resubmit the jobs first and then contact support@computecanada.ca with a list of the new jobids. We sincerely apologize for the interruption, loss of productivity and extra work this problem is causing. The migration of the /home filesystem has been completed successfully, the reconfiguration of the /scratch filesystem continues.

Incident description

System	Incident status	Start Date	End Date
Cedar	Closed

Created by Alliance Staff

Title

Cedar Outage, Oct. 28

Summary

Update Oct. 28, 19:30:

The migration of the /home filesystem has been completed successfully. We also upgraded the scheduling software (we were running an unsupported version). However, we are running into problems with the new version. We have submitted a bug support and are waiting for a resolution. Until then scheduling of jobs has been paused and we cannot allow submission of new jobs. For that reason Cedar remains unavailable for now. We will update this note as soon as we know more about handling this problem.

Cedar will be unavailable on Monday, October 28, because we are replacing the hardware for the /home filesystem to improve performance and reliability. All jobs still running the morning of October 28 will be terminated.

Cedar will become available again on October 29; however, without the /scratch filesystem. On October 29-30 the architecture of the /scratch filesystem will be changed; therefore, any jobs that require data to be read or written to /scratch during this time will fail. Please see below for more details on this outage, including how to put jobs on hold.

Details about the outage:

1) The /home filesystem and /scratch filesystem currently reside on the same storage appliance. This is problematic because when the /scratch filesystem is overloaded (typically due to jobs that put a high input/output load on the filesystem) the /home filesystem is negatively impacted as well, resulting in slow response times on the headnodes and an unsatisfactory work environment. On October 28, all data stored in the /home filesystem will be transferred to a different storage system so that /home becomes independent of /scratch. This should improve performance of the /home filesystem significantly. We expect this final transfer of data to take the entire day of October 28. Please do not make massive changes/additions to your data stored in the /home filesystem before the outage in order to minimize the time needed for this final synchronization.

2) On October 29-30, Cedar will be available. However, the /scratch filesystem will not be mounted and data stored in /scratch will not be accessible. During these two days, we will replace the controller and storage servers for the /scratch filesystem to improve performance of that filesystem and prepare it for the upcoming expansion of Cedar. Therefore, any jobs that require data to be read or written to /scratch during this time will fail. Such jobs still in the queue before the outage, should be placed on hold to prevent them from running on October 29-30.

To place a job on hold issue the following command on the headnode: scontrol hold jobid=<list of jobids>

where <list of jobids> is a comma separated list of jobids (no spaces). When the system is available again these holds can be released with the command "scontrol release <list of jobids>".

3) Loss of data: we do not expect any loss of data. We already created a copy of all data in /home, which will repeatedly be update until Oct. 28. On Oct. 28 the remaining changes will be copied once more. Furthermore, all data in /home are backed up daily. Data in /scratch are not backed up - they are meant to be stored temporarily. Nevertheless, our supporting vendor has informed us that all data in /scratch will be preserved in the conversion process.

We strongly encourage all users to move essential data out of the /scratch filesystem to their /project directories. This should be a regular practice; /scratch is not the place to store data that cannot be easily reproduced.

If there are additional questions, please email support@computecanada.ca.

Cedar non disponible le lundi 28 octobre

La grappe Cedar ne sera pas disponible le 28 octobre prochain; ce jour-là, nous remplacerons le matériel pour le système de fichiers /home en vue d’améliorer la performance et la fiabilité. Les tâches en cours au matin du 28 seront annulées.

Cedar sera à nouveau disponible le lendemain 29 octobre, à l’exception du système de fichiers /scratch. Les 29 et 30 octobre, nous ferons des modifications à ce système; les tâches nécessitant des opérations de lecture ou d’écriture dans /scratch au cours des 29 et 30 octobre seront arrêtées. Voyez comment bloquer des tâches, ci-dessous.

Information sur l’arrêt de service

1) Les systèmes de fichiers /home et /scratch utilisent présentement le même matériel, de sorte que lorsque /scratch est surchargé (typiquement en raison de tâches à haut volume d’opérations IO), /home est aussi affecté, ce qui ralentit le temps de réponse du nœud frontal et nuit à l’environnement de travail. Le 28 octobre, nous isolerons les deux systèmes en transférant les données de /home vers du matériel différent; ce travail devrait nécessiter une journée complète. Nous vous demandons d’éviter de faire de gros ajouts ou changements à vos données stockées dans /home avant cet arrêt de service pour minimiser le temps de synchronisation au terme des travaux.

2) Cedar sera disponible les 29 et 30 octobre, mais le système de fichiers /scratch n’aura pas été monté et l’accès aux données ne sera pas possible. Nous remplacerons alors le contrôleur et les serveurs de stockage de /scratch pour en améliorer la performance et préparer l’expansion prochaine de Cedar. Les tâches nécessitant des opérations de lecture ou d’écriture dans /scratch au cours des 29 et 30 octobre seront arrêtées. Nous vous recommandons de mettre en suspens les tâches qui utilisent /scratch et qui se trouveraient en attente dans la queue le 28 octobre afin qu’elles ne soient pas lancées le 29 ou le 30 octobre..

Bloquer des tâches Sur le nœud frontal, lancez scontrol hold jobid=<list of jobids> où <list of jobids> est la liste des identifiants de tâches séparés par des virgules et sans espace. Quand Cedar redeviendra disponible, débloquez les tâches avec scontrol release <list of jobids>

3) Nous ne prévoyons aucune perte de données. Toutes les données dans /home ont été copiées et une mise à jour est faite quotidiennement. Une copie des modifications sera faite encore le 28 octobre. Les données dans /scratch ne sont pas sauvegardées puisqu’il s’agit d’un espace de stockage temporaire. Notre fournisseur nous a cependant informés que les données ne seront pas affectées par la conversion.

Il est fortement recommandé de déplacer les données importantes de votre espace /scratch à votre espace /project. Ceci devrait être une pratique usuelle puisque /scratch n’est pas un endroit où stocker des données qui ne sont pas facilement reproduites.

Pour toute information en rapport avec cet arrêt de service, écrivez à support@calculcanada.ca.