All times are US Eastern Daylight Time unless otherwise indicated.
At about 5 PM on May 21, we received a call from a customer complaining that access to server pl2 was unavailable. We logged on to the server and noted that any attempts to run commands resulted in either a complaint about read-only access or a bus error. We rebooted the server, which restored access, and informed the customer.
We did some initial research and the error indicated that there may be an imminent failure of the RAID array. We noted the issue and began preparations for a maintenance window over the weekend.
At about 8 PM that night, the error recurred and we started to receive tickets indicating that the issue had resurfaced. By 10 PM, our team had met and developed an emergency maintenance action plan. We informed our customers via the ProjectLocker Blog and Twitter that there would be an emergency maintenance window on the server and told our data service provider, IBM Softlayer, to schedule an immediate diagnostic.
The diagnostic indicated that the drive failed and needed immediate replacement. The team replaced the drive and began the rebuild. By 1:47 AM on May 22 the drive had been replaced and the rebuild was in progress.
Due to the nature of the standard IBM configurations, it was not possible to run the server in a fully usable state while the RAID array rebuild was in progress. We were told the rebuild would take about 12 hours.
We checked in about 11 AM to try to get an update. Due to a miscommunication with the data team, it appeared that they indicated that the process would need to be interrupted for us to determine status, and they would receive some sort of notification when the rebuild was complete.
We maintained communications with the team throughout, and at about 4:30 PM, the miscommunication was cleared up when we asked again about when the rebuild would be complete. They rebooted the server and successfully restarted all services. The server was back online by 5:34 PM.
What Went Well?
Our team quickly isolated the issue and was able to initiate the rebuild during off-peak hours. IBM was communicative and responsive throughout the entire process. We informed customers via a publicly accessible location. No other servers were affected. We have not had an issue like this since April 2011.
What Could Have Been Done Better?
The initial response speed was appropriate. However, our in-house team should have had a better understanding of what the disaster recovery process would look like in detail and relied less on the IBM team for guidance. The majority of the downtime for the rebuild was