Post-Mortem: Emergency Maintenance on May 21-22

Published 02 June 2015 by C. G. Brown

All times are US Eastern Daylight Time unless otherwise indicated.

What Happened?

At about 5 PM on May 21, we received a call from a customer complaining that access to server pl2 was unavailable. We logged on to the server and noted that any attempts to run commands resulted in either a complaint about read-only access or a bus error. We rebooted the server, which restored access, and informed the customer.

We did some initial research and the error indicated that there may be an imminent failure of the RAID array. We noted the issue and began preparations for a maintenance window over the weekend.

At about 8 PM that night, the error recurred and we started to receive tickets indicating that the issue had resurfaced. By 10 PM, our team had met and developed an emergency maintenance action plan. We informed our customers via the ProjectLocker Blog and Twitter that there would be an emergency maintenance window on the server and told our data service provider, IBM Softlayer, to schedule an immediate diagnostic.

The diagnostic indicated that the drive failed and needed immediate replacement. The team replaced the drive and began the rebuild. By 1:47 AM on May 22 the drive had been replaced and the rebuild was in progress.

Due to the nature of the standard IBM configurations, it was not possible to run the server in a fully usable state while the RAID array rebuild was in progress. We were told the rebuild would take about 12 hours.

We checked in about 11 AM to try to get an update. Due to a miscommunication with the data team, it appeared that they indicated that the process would need to be interrupted for us to determine status, and they would receive some sort of notification when the rebuild was complete.

We maintained communications with the team throughout, and at about 4:30 PM, the miscommunication was cleared up when we asked again about when the rebuild would be complete. They rebooted the server and successfully restarted all services. The server was back online by 5:34 PM.

What Went Well?

Our team quickly isolated the issue and was able to initiate the rebuild during off-peak hours. IBM was communicative and responsive throughout the entire process. We informed customers via a publicly accessible location. No other servers were affected. We have not had an issue like this since April 2011.

What Could Have Been Done Better?

The initial response speed was appropriate. However, our in-house team should have had a better understanding of what the disaster recovery process would look like in detail and relied less on the IBM team for guidance. The majority of the downtime for the rebuild was

Continue Reading...

Topics: Disaster Recovery, Maintenance

Repository Safety at ProjectLocker

Published 18 June 2014 by Runako Godfrey

Yesterday, a security breach led to a catastrophic data loss at Codespaces, another company in the repository hosting industry. We're very sorry for the Codespaces customers who lost data, as well as the Codespaces team, who suffered the loss of years of their hard work.

Continue Reading...

Topics: Business, Security, Disaster Recovery

Software Development Is Hard, or A Meditation on Why We're Here

Published 21 April 2014 by C. G. Brown

 

Continue Reading...

Topics: Craftsmanship, Business, Software Development, Disaster Recovery

Disaster Recovery for Your Code

Published 15 April 2014 by Runako Godfrey

Business continuity planning sucks. You have to think of all the things that can go wrong that can interrupt your business and then design strategies to mitigate them in advance. It's a lot of important work, and if you're  lucky, it will be mostly wasted effort.

Continue Reading...

Topics: Disaster Recovery

Get Updates by Email

Follow @ProjectLockerHQ on Twitter

Follow Us

Free Checklist: How to Choose Source Control for your Project