Outage on *1.projectlocker.com - Status and Updates

Published 17 July 2014 by C. G. Brown

This morning, we are aware of an unscheduled outage on *1.projectlocker.com. Users on other servers are not affected. We'll keep running updates in this post.

Background

Last night, the server that hosts *1.projectlocker.com was physically relocated. The server was brought back online successfully. The primary IP is working, however, the secondary subnet on which Subversion, Trac, and Git are hosted appears to have a network routing issue. We're working with our data center team to get this resolved as quickly as possible.

Updates (all times EDT / GMT-4)

10:00 am - Confirmed that data center is working on problem, and made them aware of possible network issue.  Awaiting reply.

10:12 am - Observing routing working properly from our offices. Services may be back online, but we're awaiting confirmation from our data center.

10:18 am - Data center has confirmed that the VLAN had a misconfiguration after the server relocation. The misconfiguration is resolved and services should be back online.

Why Did This Happen?

Some of you have noticed a few problems on this server over the past few weeks. The bad news is that the outages have led us to a lower standard of service from us than you've come to expect, and for that we sincerely apologize. The good news is a two-parter: a) the issues affecting the server were unrelated, and b) the issues have been each isolated and resolved. We're going to walk through each of the problems below and talk about how we are dealing with them.

Problem: Storage

We had one of the disks run out of space recently. This was due to the logs not being properly rotated. We fixed the issue with the log rotation and do not anticipate further problems there. We don't provision new accounts on that server, so organic growth and logs are the source of additional disk usage.

Problem: Hardware

We were seeing instability a couple of weeks ago on pl1. We did some research but could not isolate an issue. We had our data center run a hardware diagnostic and they identified a defective RAM stick, which is statistically common after several years of operation. They replaced the RAM stick, and the stability issue was resolved.

Problem: Relocation

Our data center relocated the physical server housing pl1 last night. While this is not a common procedure (we've only seen it done once in 11 years and that was due to an emergency), our data center usually is fantastic about getting everything back online. However, while they got the primary IP back online correctly, the secondary subnet that serves all the ProjectLocker services had a misconfiguration on their network. This led to the servers appearing to be online with respect to our monitoring tools, but appearing to be offline with respect to our customers. Once we isolated the problem and informed them, they promptly got the servers back online.

Going Forward

We are meeting as a team to discuss our monitoring procedures and response strategy. We are an America-based team but we recognize that our customers work around the clock and around the world. Know that we take any outage seriously, and that we are working hard to continue to give you enterprise-grade support at an affordable price.

Subscribe to ProjectLocker's Blog

Follow Us

Get Updates by Email

Follow @ProjectLockerHQ on Twitter

Follow Us

Free Checklist: How to Choose Source Control for your Project