Post-Mortem: Emergency Maintenance on May 21-22

Published 02 June 2015 by C. G. Brown

All times are US Eastern Daylight Time unless otherwise indicated.

What Happened?

At about 5 PM on May 21, we received a call from a customer complaining that access to server pl2 was unavailable. We logged on to the server and noted that any attempts to run commands resulted in either a complaint about read-only access or a bus error. We rebooted the server, which restored access, and informed the customer.

We did some initial research and the error indicated that there may be an imminent failure of the RAID array. We noted the issue and began preparations for a maintenance window over the weekend.

At about 8 PM that night, the error recurred and we started to receive tickets indicating that the issue had resurfaced. By 10 PM, our team had met and developed an emergency maintenance action plan. We informed our customers via the ProjectLocker Blog and Twitter that there would be an emergency maintenance window on the server and told our data service provider, IBM Softlayer, to schedule an immediate diagnostic.

The diagnostic indicated that the drive failed and needed immediate replacement. The team replaced the drive and began the rebuild. By 1:47 AM on May 22 the drive had been replaced and the rebuild was in progress.

Due to the nature of the standard IBM configurations, it was not possible to run the server in a fully usable state while the RAID array rebuild was in progress. We were told the rebuild would take about 12 hours.

We checked in about 11 AM to try to get an update. Due to a miscommunication with the data team, it appeared that they indicated that the process would need to be interrupted for us to determine status, and they would receive some sort of notification when the rebuild was complete.

We maintained communications with the team throughout, and at about 4:30 PM, the miscommunication was cleared up when we asked again about when the rebuild would be complete. They rebooted the server and successfully restarted all services. The server was back online by 5:34 PM.

What Went Well?

Our team quickly isolated the issue and was able to initiate the rebuild during off-peak hours. IBM was communicative and responsive throughout the entire process. We informed customers via a publicly accessible location. No other servers were affected. We have not had an issue like this since April 2011.

What Could Have Been Done Better?

The initial response speed was appropriate. However, our in-house team should have had a better understanding of what the disaster recovery process would look like in detail and relied less on the IBM team for guidance. The majority of the downtime for the rebuild was

unavoidable but it is not clear that the last couple of hours of downtime were warranted given the misunderstanding about the machine’s requirements for reporting.

We will address the obvious question of redundancy separately below.

What Will We Do Differently?

We are meeting with the IBM team in the next couple of weeks to discuss our protocols and ensure that we have a more detailed action plan for different types of hardware failures. In particular, this plan should have more frequent checkins; while we monitored the communication thread for updates, we could have been more proactive about following up, despite the extended ETA we received from our rebuild team.

We will review our internal reporting protocols to ensure we’re communicating with affected customers correctly and in the ways they prefer to be reached.

What about Redundancy?

A few customers expressed concerns that we do not offer hot or warm failover in the event of a catastrophic failure. First, let me assure you that we take a number of measures that we take to protect your data. Measures include:

  • 24x7 monitoring

  • Regular disaster recovery backups

  • RAID 10 array configuration so that no single drive failure can cause data loss

  • Distributed production architecture so that there is no single point of failure for repository access that affects all customers

These measures are why in 11 years of operation, ProjectLocker has never lost a repository’s data.

We’ve only seen failures of this magnitude about 3 times over the past 11 years, ranging from drive failures to a data center fire in 2008. In none of these failures did we lose data, although we did have customers offline for longer than we would have liked.

We’ve made a conscious choice to invest in protection over redundancy so that we can continue to offer the service to most teams at the price points the market demands ($19-$99/month for most organizations). For customers that do require warm failover because they cannot afford any downtime, we can provide a quote for an offering at additional cost. You can contact us at [email protected] if you’re interested in this offering.

Conclusion

The ProjectLocker team feeds our families by keeping your data safe, secure, and accessible. We don’t have a more important or more pressing job. Please know that we take every reasonable measure to provide enterprise-grade service at prosumer prices to your software development team. If you have any questions for us, don’t hesitate to reach out. 

Topics: Disaster Recovery, Maintenance

Subscribe to ProjectLocker's Blog

Follow Us

Get Updates by Email

Follow @ProjectLockerHQ on Twitter

Follow Us

Free Checklist: How to Choose Source Control for your Project