redundancy | Open Query

As I reported via Twitter late last week, we encountered an issue that got some of our mail delivery delayed by about a day and a half. I’ll explain more about what happened as I believe in openness on these matters, and also the experience has educational content for others.

Our mail server doesn’t have direct external interaction, it’s shielded by two relays that handle both the inbound MX and the outbound queue. This setup works remarkably well in terms of exposure to spam and other malicious activity. As previously discussed, it appears that it’s more difficult to make mail server infra more resilient without expending lots more time/effort and infrastructure expenditure. Just because of the way the common tools for mail delivery and imap are built, having two or more of each in a semi-active setup gets quite complex. Complexity is in itself a risk so it has to be considered in relation to the costs and risks of the alternatives.

When our mail server becomes unavailable, incoming mail is queued, and we have backups so no mail is actually lost. The cost is the time and effort involved in getting a full replacement server up and running from a backup. That can be optimised/prepared to a point, but mail is still a lot more data than most other web infrastructure so shuffling that data around just takes a while. Some outbound queues from our online services (for instance our client services system Redmine) goes straight to the relays so there is less impact there. Apart from backups elsewhere, have redundancy for the mailserver: an identical instance on a server in the same DC (those servers are our own).

So what happened last week? Our servers resided in a rack which was leased from the DC by another company through which we “sublet” the rack space, connection and bandwidth. This is a common scenario, as small businesses don’t generally need a full rack and datacentres prefer dealing with fewer/bigger clients and set their pricing accordingly. The intermediate company became unavailable which put our servers in a temporary legal limbo. The DC only gives access to the primary lessor of the rack, so us asking for access to move our servers wasn’t straightforward. Of course we had documentation to back up our assertion as to which equipment was ours, but as you can imagine that legal avenue takes longer to resolve – fortunately the owner of the intermediate company communicated well with the operations manager at the DC and that’s how we were able to retrieve our gear relatively quickly.

We’re still in the same DC, but are now a direct client of the DC in a shared rack. That may appear odd in the context of what I wrote before, but since we first moved there several years ago the DC has improved their infrastructure management to the point where servicing smaller clients is not a resource drain and thus they have sensible plans available. That’s brilliant given the market, but it’s actually quite unusual – commonly companies aim for bigger clients rather than recognising an opportunity to server small clients.

While this was going on we were of course working on a separate replacement mailserver, built from the backups. Since normally we’d have a replacement server already set up, the “build from scratch using backups” is a slower path. As it turned out, we got our servers back online around the same time we had our replacement ready, and for various reasons it was easier to just use the original servers at that point.

From this story you can work out several useful lessons, remembering that it’s always a trade-off. At some point the cost of being able to mitigate a particular scenario is so high that it’s not worthwhile. You just have to plan for several most common possibilities, with a slower recovery from backup as the last resort.

There’s also another piece of information which is highly relevant for Australian businesses, and that’s the Australian Personal Property Securities Register. Legislation for this system was enacted in 2009, the scheme is only since January 2012 and there’s a two-year transitional period. Remember how “posession is 9/10ths of the law” ? Well, if you ignore PPS it’s now 10/10ths. It is the primary and only register and reference for ownership of items (and data!) that are in care of another legal entity. So we own some servers, that reside in a rack of another company in a DC. We register ourselves, and then our servers (short description and serial numbers and such) and associated data content with PPS, against both the intermediate company (which had legal charge over the rack they reside in) and the hosting company (where the items physically reside). This way, we have a claim that indeed the stuff is ours, but also since the PPS is the only register we have ensure that noone else (inadvertently or even maliciously) claims to own something that’s actually ours. If you have a similar situation (and remember that data is as important as physical items!) you want to register it with PPS. The registration process is somewhat convoluted, but it is free – searches cost. Remember IANAL (I am not a lawyer) so do your research and get appropriate legal advice. If you’re not in Australia, other similar legislation may apply and you’ll want to check to make sure you’re safe.

This is a “dogfood” type story (see below for explanation of the term)… Open Query has ideas on resilient architecture which it teaches (training) and recommends (consulting, support) to clients and the general public (blog, conferences, user group talks). Like many other businesses, when we first started we set up our infrastructure quickly and on the cheap, and it’s grown since. That’s how things grow naturally, and is as always a trade-off between keeping your business running and developing while also improving infrastructure (business processes and technical).

Quite a few months ago we also started investing (mostly time) in the technical infrastructure, and slowly moving the various systems across to new servers and splitting things up along the way. Around the same time, the main webserver frequently became unresponsive. I’ll spare you the details, we know what the problem was and it was predictable, but since it wasn’t our system there was only so much we could do. However, systems get dependencies over time and thus it was actually quite complicated to move. In fact, apart from our mail, the public website was the last thing we moved, and that was through necessity not desire.

Of course it’s best for a company when their public website works, it’s quite likely you have noticed some glitches in ours over time. Now running on the new infra, I happened to take a quick peek at our Google Analytics data, and noticed an increase in average traffic numbers of about 40%. Great big auch.

And I’m telling this, because I think it’s educational and the world is generally not served by companies keeping problems and mishaps secret. Nasties grow organically and without malicious intent, improvements are a step-wise process, all that… but in the end, the net results of improvements can be more amazing than just general peace of mind! And of course it’s very important to not just see things happen, but to actively work on those incremental improvements, ongoing.

Our new infra has dual master MySQL servers (no surprise there 😉 but based in separate data centres so that makes the setup a bit more complicated (MMM doesn’t deal with that setup). Other “new” components we use are lighttpd, haproxy, and Zimbra (new in the sense that our old external infra used different tech). Most systems (not all, yet) are redundant/expendable and run on a mix of Linode instances and our own machines. Doing these things for your own infra is particularly educational, it provides extra perspective. The result is, I believe, pretty decent. Failures generally won’t cause major disruption any more, if at all. Of course, it’s still work in progress.

Running costs of this “farm”? I’ll tell later, as I think it’s a good topic for a poll and I’m curious: how much do you spend on server infrastructure per month?

Background for non-Anglophones: “eating your own dogfood” refers to a company doing themselves what they’re recommending to their clients and in general. Also known as “leading by example”, but I think it’s also about trust and credibility. On the other hand, there’s the “dentist’s tooth-ache” which refers to the fact that doctors are their own worst patients 😉

Tag: redundancy

Server Ownership Legalities

Dogfood: making our systems more resilient

Share this:

Share this: