As I reported via Twitter late last week, we encountered an issue that got some of our mail delivery delayed by about a day and a half. I’ll explain more about what happened as I believe in openness on these matters, and also the experience has educational content for others.
Our mail server doesn’t have direct external interaction, it’s shielded by two relays that handle both the inbound MX and the outbound queue. This setup works remarkably well in terms of exposure to spam and other malicious activity. As previously discussed, it appears that it’s more difficult to make mail server infra more resilient without expending lots more time/effort and infrastructure expenditure. Just because of the way the common tools for mail delivery and imap are built, having two or more of each in a semi-active setup gets quite complex. Complexity is in itself a risk so it has to be considered in relation to the costs and risks of the alternatives.
When our mail server becomes unavailable, incoming mail is queued, and we have backups so no mail is actually lost. The cost is the time and effort involved in getting a full replacement server up and running from a backup. That can be optimised/prepared to a point, but mail is still a lot more data than most other web infrastructure so shuffling that data around just takes a while. Some outbound queues from our online services (for instance our client services system Redmine) goes straight to the relays so there is less impact there. Apart from backups elsewhere, have redundancy for the mailserver: an identical instance on a server in the same DC (those servers are our own).
So what happened last week? Our servers resided in a rack which was leased from the DC by another company through which we “sublet” the rack space, connection and bandwidth. This is a common scenario, as small businesses don’t generally need a full rack and datacentres prefer dealing with fewer/bigger clients and set their pricing accordingly. The intermediate company became unavailable which put our servers in a temporary legal limbo. The DC only gives access to the primary lessor of the rack, so us asking for access to move our servers wasn’t straightforward. Of course we had documentation to back up our assertion as to which equipment was ours, but as you can imagine that legal avenue takes longer to resolve – fortunately the owner of the intermediate company communicated well with the operations manager at the DC and that’s how we were able to retrieve our gear relatively quickly.
We’re still in the same DC, but are now a direct client of the DC in a shared rack. That may appear odd in the context of what I wrote before, but since we first moved there several years ago the DC has improved their infrastructure management to the point where servicing smaller clients is not a resource drain and thus they have sensible plans available. That’s brilliant given the market, but it’s actually quite unusual – commonly companies aim for bigger clients rather than recognising an opportunity to server small clients.
While this was going on we were of course working on a separate replacement mailserver, built from the backups. Since normally we’d have a replacement server already set up, the “build from scratch using backups” is a slower path. As it turned out, we got our servers back online around the same time we had our replacement ready, and for various reasons it was easier to just use the original servers at that point.
From this story you can work out several useful lessons, remembering that it’s always a trade-off. At some point the cost of being able to mitigate a particular scenario is so high that it’s not worthwhile. You just have to plan for several most common possibilities, with a slower recovery from backup as the last resort.
There’s also another piece of information which is highly relevant for Australian businesses, and that’s the Australian Personal Property Securities Register. Legislation for this system was enacted in 2009, the scheme is only since January 2012 and there’s a two-year transitional period. Remember how “posession is 9/10ths of the law” ? Well, if you ignore PPS it’s now 10/10ths. It is the primary and only register and reference for ownership of items (and data!) that are in care of another legal entity. So we own some servers, that reside in a rack of another company in a DC. We register ourselves, and then our servers (short description and serial numbers and such) and associated data content with PPS, against both the intermediate company (which had legal charge over the rack they reside in) and the hosting company (where the items physically reside). This way, we have a claim that indeed the stuff is ours, but also since the PPS is the only register we have ensure that noone else (inadvertently or even maliciously) claims to own something that’s actually ours. If you have a similar situation (and remember that data is as important as physical items!) you want to register it with PPS. The registration process is somewhat convoluted, but it is free – searches cost. Remember IANAL (I am not a lawyer) so do your research and get appropriate legal advice. If you’re not in Australia, other similar legislation may apply and you’ll want to check to make sure you’re safe.