Quest for Resilience: Multi-DC Masters

This is a Request for Input. Dual MySQL masters with MMM in a single datacentre are in common use, and other setups like DRBD and of course VM/SAN based failover solutions are conceptually straightforward also. Thus, achieving various forms of resilience within a single data-centre is doable and not costly.

Doing the same across multiple (let’s for simplicity sake limit it to two) datacentres is another matter. MySQL replication works well across longer links, and it can use MySQL’s in-built SSL or tools like stunnel. Of course it needs to be kept an eye on, as usual, but since it’s asynchronous the latency between the datacentres is not a big issue (apart from the fact that the second server gets up-to-date a little bit later).

But as those who have tried will know, having a client (application server) connection to a MySQL instance in a remote data-centre is a whole other matter, latency becomes a big issue and is generally very noticeable on the front-end. One solution for that is to have application servers only connect to their “local” MySQL server.

So the question to you is, do you now have (or have you had in the past) a setup with MySQL masters in different datacentres, what did that setup look like (which additional tools and infra did you use for it), and what were your experiences (good and bad, solutions to issues, etc). I’m trying to gather additional expertise that might already be about, which can help us all. Please add your input! thanks

4 thoughts on “Quest for Resilience: Multi-DC Masters”

  1. We have master-master and regular setups across data centers. The master-master setup is only for quick fail over. We monitor slave status very closely and we have determined that under normal conditions, the setup works just fine with all the caveats and considerations.

    I would not recommended for high traffic since it will be almost guaranteed that the slave will fall behind for a few seconds. How much is “high” will depend on the link between data centers and the particular application profile.

    Reading from the slave across data centers has never been an issue for us.

    My $.02

  2. We run a geographic setup using circular replication and DRBD combined.

    Each DC has two MySQL master servers using DRBD failover to handle local failure on the nodes. Then the servers in the DC are linked together using circular replication.

    We wrote our own scripts to handle the change over of Slave nodes feeding off its local Master because we have had this kind of setup long before tools like MMM became available.

    We have web/application servers in each DC so if one DC goes down then all customer traffic is diverted to the other DC using DNS failover and a low TTL. DNS failover is not perfect but from our testing it works well enough for Web for ~94% of our customers.

  3. Running replication between DCs for a couple years, I’ve only seen seconds_behind > 1s when relay log processing stopped. (error, etc.)

    If the APP is reasonably efficient, and its behavior is known/measured, WAN connections shouldn’t be a problem – read only client, or read/write (MM) assuming proper planning and SLA. But of course that’s theory.

    Example problem: Important site gets an unexpected traffic spike due to a special event. Its WordPress implementation is horribly inefficient (lots of poorly written queries in plugins, templates, etc.) and the primary server is struggling – hurting hundreds of other sites. (of course this happens at the end of a long day, with zero notice)

    Solution: Move connections to the slave, accessed over WAN, abandoning consistency but spreading the load between two servers. This works great, for several days.

    New Problem: Possibly due to DC admin’s throttling, packet loss between the two servers hits ~30% and major problems return for the inefficient site. Simple queries slow, and active connections pile up.

    New Solution: The spike in traffic has subsided, so moving the connections back to the primary server, dump from slave, truncate tables on both M+S, and restore to master.

    Yeah, it’s probably time for some capacity planning and disaster recovery discussions. :)

  4. Hey Arjen,

    At Yahoo!, where this was a pretty common requirement, we came up with a model that used a simple TCP proxy to forward traffic, similar (but slightly more heavy-handed) to using a floating/role IP and IP takeover. Since IP takeover doesn’t work across routers, it can’t be used, but this gets you similar behaviour.



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>