Tag Archives: recovery

Server Ownership Legalities

As I reported via Twitter late last week, we encountered an issue that got some of our mail delivery delayed by about a day and a half. I’ll explain more about what happened as I believe in openness on these matters, and also the experience has educational content for others.

Our mail server doesn’t have direct external interaction, it’s shielded by two relays that handle both the inbound MX and the outbound queue. This setup works remarkably well in terms of exposure to spam and other malicious activity. As previously discussed, it appears that it’s more difficult to make mail server infra more resilient without expending lots more time/effort and infrastructure expenditure. Just because of the way the common tools for mail delivery and imap are built, having two or more of each in a semi-active setup gets quite complex. Complexity is in itself a risk so it has to be considered in relation to the costs and risks of the alternatives.

When our mail server becomes unavailable, incoming mail is queued, and we have backups so no mail is actually lost. The cost is the time and effort involved in getting a full replacement server up and running from a backup. That can be optimised/prepared to a point, but mail is still a lot more data than most other web infrastructure so shuffling that data around just takes a while. Some outbound queues from our online services (for instance our client services system Redmine) goes straight to the relays so there is less impact there. Apart from backups elsewhere, have redundancy for the mailserver: an identical instance on a server in the same DC (those servers are our own).

So what happened last week? Our servers resided in a rack which was leased from the DC by another company through which we “sublet” the rack space, connection and bandwidth. This is a common scenario, as small businesses don’t generally need a full rack and datacentres prefer dealing with fewer/bigger clients and set their pricing accordingly. The intermediate company became unavailable which put our servers in a temporary legal limbo. The DC only gives access to the primary lessor of the rack, so us asking for access to move our servers wasn’t straightforward. Of course we had documentation to back up our assertion as to which equipment was ours, but as you can imagine that legal avenue takes longer to resolve – fortunately the owner of the intermediate company communicated well with the operations manager at the DC and that’s how we were able to retrieve our gear relatively quickly.

We’re still in the same DC, but are now a direct client of the DC in a shared rack. That may appear odd in the context of what I wrote before, but since we first moved there several years ago the DC has improved their infrastructure management to the point where servicing smaller clients is not a resource drain and thus they have sensible plans available. That’s brilliant given the market, but it’s actually quite unusual – commonly companies aim for bigger clients rather than recognising an opportunity to server small clients.

While this was going on we were of course working on a separate replacement mailserver, built from the backups. Since normally we’d have a replacement server already set up, the “build from scratch using backups” is a slower path. As it turned out, we got our servers back online around the same time we had our replacement ready, and for various reasons it was easier to just use the original servers at that point.

From this story you can work out several useful lessons, remembering that it’s always a trade-off. At some point the cost of being able to mitigate a particular scenario is so high that it’s not worthwhile. You just have to plan for several most common possibilities, with a slower recovery from backup as the last resort.

There’s also another piece of information which is highly relevant for Australian businesses, and that’s the Australian Personal Property Securities Register. Legislation for this system was enacted in 2009, the scheme is only since January 2012 and there’s a two-year transitional period. Remember how “posession is 9/10ths of the law” ? Well, if you ignore PPS it’s now 10/10ths. It is the primary and only register and reference for ownership of items (and data!) that are in care of another legal entity. So we own some servers, that reside in a rack of another company in a DC. We register ourselves, and then our servers (short description and serial numbers and such) and associated data content with PPS, against both the intermediate company (which had legal charge over the rack they reside in) and the hosting company (where the items physically reside). This way, we have a claim that indeed the stuff is ours, but also since the PPS is the only register we have ensure that noone else (inadvertently¬†or even maliciously) claims to own something that’s actually ours. If you have a similar situation (and remember that data is as important as physical items!) you want to register it with PPS. The registration process is somewhat convoluted, but it is free – searches cost. Remember IANAL (I am not a lawyer) so do your research and get appropriate legal advice. If you’re not in Australia, other similar legislation may apply and you’ll want to check to make sure you’re safe.

 

What a Hosting Provider did Today

I found Dennis the Menace, he now has a job as system administrator for a hosting company. Scenario: client has a problem with a server becoming unavailable (cause unknown) and has it restarted. MySQL had some page corruption in the InnoDB tablespace.

The hosting provider, being really helpful, goes in as root and first deletes ib_logfile* then ib* in /var/lib/mysql. He later says “I am sorry if I deleted it. I thought I deleted the log only. Sorry again.”¬† Now this may appear nice, but people who know what they’re doing with MySQL will realise that deleting the iblogfiles actually destroys data also. MySQL of course screams loudly that while it has FRM files it can’t find the tables. No kidding!

Then, while he’s been told to not touch anything any more, and I’m trying to see if I can recover the deleted files on ext3 filesystem (yes there are tools for that), he goes in again and puts an ibdata1 file back. No, not the logfiles – but he had those somewhere else too. The files get restored and turn out to be two months old (no info on how they were made in the first place but that’s minor detail in this grand scheme). All the extra write activity on the partition would’ve also made potential deleted file recovery more difficult or impossible.

This story will still get a “happy” ending, using a recent mysqldump to load a new server at a different hosting provider. Really – some helpfulness is not what you want. Secondary lesson: pick your hosting provider with care. Feel free to ask us for recommendations as we know some excellent providers and have encountered plenty of poor ones.

MySQL data backup: going beyond mysqldump

A user on a linux user group mailing list asked about this, and I was one of the people replying. Re-posting here as I reckon it’s of wider interest.

> […] tens of gigs of data in MySQL databases.
> Some in memory tables, some MyISAM, a fair bit InnoDB. According to my
> understanding, when one doesn’t have several hours to take a DB
> offline and do dbbackup, there was/is ibbackup from InnoBase.. but now
> that MySQL and InnoBase have both been ‘Oracle Enterprised’, said
> product is now restricted to MySQL Enterprise customers..
>
> Some quick searching has suggested Percona XtraBackup as a potential
> FOSS alternative.
> What backup techniques do people employ around these parts for backups
> of large mixed MySQL data sets where downtime *must* be minimised?
>
> Has your backup plan ever been put to the test?

You should put it to the test regularly, not just when it’s needed.
An untested backup is not really a backup, I think.

At Open Query we tend to use dual master setups with MMM, other replication slaves, mysqldump, and XtracBackup or LVM snapshots. It’s not just about having backups, but also about general resilience, maintenance options, and scalability. I’ll clarify:

  • XtraBackup and LVM give you physical backups. that’s nice if you want to recover or clone a complete instance as-is. But if anything is wrong, it’ll be all stuffed (that is, you can sometimes recover InnoDB tablespaces and there are tools for it, but time may not be on your side). Note that LVM cannot snapshot between multiple volumes consistently, so if you have your InnoDB ibdata/IBD files and iblog files on separate spindles, using LVM is not suitable.
  • mysqldump for logical (SQL) backups. Most if not all setups should have this. Even if the file(s) were to be corrupted, they’re still readable since it’s plain SQL. You can do partial restores, which is handy in some cases. It’ll be slower to load so having *only* an SQL dump of a larger dataset is not a good idea.
  • some of the above backups can and should *also* be copied off-site. that’s for extra safety, but in terms of recovery speed it may not be optimal and should not be relied upon.
  • having dual masters is for easier maintenance without scheduled outages, as well as resilience when for instance hardware breaks (and it does).
  • slaves. You can even delay a slave (Maatkit has a tool for this), so that would give you a live correct image even in case of a user error, provided you get to it in time. Also, you want enough slack in your infra to be able to initialise a new slave off an existing one. Scaling up at a time when high load is already occurring can become painful if your infra is not prepared for it.

A key issue to consider is this… if the dataset is sufficiently large, and the online requirements high enough, you can’t afford to just have backups. Why? Because, how quickly can you deploy new suitable hardware, install OS, do restore, validate, put back online?

In many cases one or more aspects of the above list simply take too long, so my summary would be “then you don’t really have a backup”. Clients tend to argue with me on that, but only fairly briefly, until they see the point: if a restore takes longer than you can afford, that backup mechanism is unsuitable.

So, we use a combination of tools and approaches depending on needs, but in general terms we aim for keeping the overall environment online (individual machines can and will fail! relying on a magic box or SAN to not fail *will* get you bitten) to vastly reduce the instances where an actual restore is required.
Into that picture also comes using separate test/staging servers to not have developers stuff around on live servers (human error is an important cause of hassles).

In our training modules, we’ve combined the backups, recovery and replication topics as it’s clearly all intertwined and overlapping. Discussing backup techniques separate from replication and dual master setups makes no sense to us. It needs to be put in place with an overall vision.

Note that a SAN is not a backup strategy. And neither is replication on its own.

Tool of the Day: rsnapshot

rsnapshot is a filesystem snapshot utility for making backups of local and remote systems, based on rsync. Rather than just doing a complete copy every time, it uses hardlinks to create incrementals (which are from a local perspective a full backup also). You can specify how long to keep old backups, and all the other usual jazz. You’d generally have it connect over ssh. You’ll want/need to run it on a filesystem that supports hardlinks, so that precludes NTFS.

In the context of MySQL, you can’t just do a filesystem copy of your MySQL data/logs, that would be inconsistent and broken. (amazingly, I still see people insisting/arguing on this – but heck it’s your business/data to gamble with, right?)

Anyway, if you do a local mysqldump also, or for instance use XtraBackup to take a binary backup of your InnoDB tablespace/logs, then rsnapshot can be used to automate the transfer of those files to a different geographical location.

Two extra things you need to do:

  • Regularly test your backups. They can fail, and that can be fatal. For XtraBackup, run the prepare command and essentially start up a MySQL instance on it to make sure it’s all happy. Havint this already done also saves time if you need to restore.
  • For restore time, you need to include the time needed to transfer files back to the target server.