Posted on

A day in the life of Datacenter Disasters

Open Query currently hosts a large part of our infrastructure at Linode. We are extremely happy with their performance, stability and support. Unfortunately any chain is only as strong as it’s weakest link. This week, there was a major thunderstorm near the Hurricane Electric datacenter (anyone else think that name is funny in combination with the event in case?) in fremont and through a massive powersurge, most of HE’s datacenter lost power. Among the Linodes affected in our infrastructure were all of the machines involved in our MMM setup.

The masters came back up before the monitor, which is around the time I was alerted. Logging in, I noticed replication was broken on one of the masters, but the other master seemed healthy. Since the monitor was not up and it seemed like it could potentially be hours before it would, I decided it was time for manual action. Since our MMM setup doesn’t have slaves currently, I decided a good option would be to mimic MMM and move the virtual IP to the healthy server.

I executed the following manual commands to make the desired changes:

$ ip addr add <virtip> dev eth0
$ /usr/sbin/arping -I eth0 -c 5 <virtip>

That brought all our applications back online, which was the desired effect. I manually fixed replication by repositioning the masters. A while later, the monitor came up and automatically took over, bringing everything back to normal.

Everything went well, but it wasn’t until the next morning I realised there was a possible flaw in my logic (that din’t effect us, but I wanted to blog about it to make others realise): When replication stopped, master A was active. My commands above made master B the active master. Now, in theory it is possible that writes were sent to master A after replication broke, and commands that were sent to master B would presume those writes were executed there which they were not as replication didn’t execute them. This is one of those niche occasions where data-drift can occur without noticing it.

My recommendation is to not do what I did unless you are very certain your setup doesn’t suffer from this potential problem. If you do decide to use this trick however, make sure to use the maatkit mk-tablecheck and mk-tablesynch when all is well again to check for (and correct!) data drift.