Updates


Event Date Summary

We had a major failure of SAN infrastructure (backups, restores, nearline), and this stopped the /project recovery process.

The SAN has been rebuilt and is operating, and backups are succeeding again.

The recovery of the 300TB of data across over 15,000,000 files continues. The restoration was slowed by other unrelated hardware issues that have now been resolved. To date over 10,000,000 files have been recovered and over 4,000 impacted users have had their files fully restored. Unfortunately ~6% of Graham users are still impacted. We continue to prioritize the recovery of this data and should have everything restored in the coming weeks.  Thank you for your patience.

The data restore from the backup system continues to progress. So far over 8.5 million files have been successfully restored from tape. We encountered an issue with our backup solution and engaged the vendor for resolution.

The data restore from the backup system is proceeding well. So far over 5 million files have been successfully restored from tape. At the current rate all files will be restored by early February. Files will start appearing back in their original locations within the project space this week.We are restoring the files in the most efficient way possible. Files within a specific project may be restored in several batches rather than all at once because of how they are distributed across the backup tapes. Please watch for an email with more details.

We are continuing to recover data from the failed project array. If you attempt to access an affected file you will get an error "Cannot send after transport endpoint shutdown". If you try to view using a graphical tool any directory containing an affected file then it will hang. More details will be emailed out next week.

This problem is due to the simultaneous failure of 3 disks.  Each OST is configured as a RAID6, so can only tolerate failure of two drives.  We don't understand the cause of the failures, but it doesn't seem to be systematic (not due to the backplane, loose cables, etc).

Currently, if you attempt to access a file on the failed OST, you'll get an error "Cannot send after transport endpoint shutdown".

We are contacting a disk recovery service, but don't yet have any estimate of how long that would take, or likelihood of success.

We have file-wise backups, and are already recovering failed files from tape.  This process takes a long time because there are about 15M files, spread across many tapes.  We will provide additional information here as it becomes available.  We expect to make a direct email communication with the PI of each affected project.

If you have a critical need for a specific set of files, please let us know (via email to support@tech.alliancecan.ca as normal).

During a routine replacement of a failed disk, other components in the same shelf started showing errors.  This may be due to something simple like a loose connection, so we're checking that.  We'll also be validating the affected disks and doing a full power cycle of the chassis.

This problem currently affects one OSS (server) which provides three shelves of disks (OSTs).  If your files happen to be provided by other OSSes, your access to files on /project should be unaffected.  If you see IO on /project that doesn't complete, it's probably due to this  problem. This doesn't affect /home or /scratch or /nearline.


Incident description

System Incident status Start Date End Date
Graham Open No closed date
Created by Kaizaad Bilimorya on

Title


/project Filesystem problem - Problème de système de fichiers


Summary


We have a problem with one of the OSTs on /project and are investigating. This is currently affecting access to all files on /project. 


Updated by Mark Hahn on