Posted on 4 Comments

MySQL UC tutorials – “Mastering Changes and Upgrades to Mission Critical Systems”

As earlybird registration for the MySQL User Conference is open, I’d like to draw your attention to one tutorial which I think is of particular interest: Mastering Changes and Upgrades to Mission Critical Systems, by Andrew Cowie of Operational Dynamics.
It is about how to plan upgrades (or any other changes) on mission critical systems. Of course this is highly relevant for MySQL installations. How do multiple teams interact, and how can you actually test your plans?

Now, I hear you think “I know all that”… but do you? Really? I’ve come to the conclusion that I don’t – that’s my honest answer. I now know there’s much more to it than meets the eye!
There’s people who make it their business to develop techniques for this, and Andrew is a good one. If you manage or are part of a team working with mission critical systems, I’d suggest that going to this tutorial would be time well spent.

Registration is at http://www.mysqluc.com/pub/w/35/register.html, remember it’s still earlybird time – you can save a bundle by registering sooner rather than later. I understand that earlybird registrations also receive some free O’Reilly or MySQL books!

The full list of tutorials is listed on http://www.mysqluc.com/pub/w/35/tutorials.html

Posted on 4 Comments

4 thoughts on “MySQL UC tutorials – “Mastering Changes and Upgrades to Mission Critical Systems”

  1. An interesting one. Reminds me of an interesting event during the master switch Wikipedia did yesterday. Last step before the switch, to be sure nothing could touch the old master and via it the slaves about to switch master:

    mysqladmin shutdown

    Now nothing can be using it and some assorted back office services using the master but not with our normal control files can’t suffer any pain from not getting access to the database for a while.

    Except: there was an exception while it was shutting down. Mysqd_safe saw that and brought it back up, not read only. And those back office tasks which shouldn’t have been writing to it, did.

    Oops. Sometimes a safety net isn’t.:)

  2. That’s a tough one. It’s not a case of “test it first” (duh, really?) or even “do rehearsals” (although that’s really important and I talk with people a lot about how to practice this sort of thing effectively) as yours was a case of an unexpected failure mode as you were running the procedure.

    For your particular event I probably would have approached blocking off the back office (and front side production platform) a little more rigorously, although that really just boils down to stopping things around the database before trying to stop the database in the middle + a whole lot of verification testing.

    Which of course is tricky if you’re trying to do a transparent upgrade to live production systems.

    Sadly, this sort of thing is common. Switching masters is tricky, especially in a master/slave or active-primary/standby-secondary type setup. You switch, and something doesn’t actually move over to the new node, and your testing doesn’t catch it. [Oracle + VCS cluster is renowned for this sort of disaster].

    In fact, this sort of thing is so serious that the real solution to it is to engineer it out of the equation. That usually means multiple actives [not primary/secondary, and yeah, I’m getting away from MySQL terminology here] with the ability to isolate and down one of the actives. You then upgrade the isolated one – and here’s the catch – then point some equally isolated subset of the web and app server tiers at the newly upgraded database. That’s really hard, because it means encoding the ability to select which active database server you’re hitting into your application layer code.

    Once you’ve got that, however, you then have the ability to leave the primary system running, do the upgrade, and test against the new systems all within the production platform. It’s a steep road to climb, but if you do so, you suddenly achieve an unprecedented degree of flexibility, which in turn gives you resiliency against all sorts of change, ranging from system/OS upgrades to database re-installs to creative break-the-mirror type backup solutions. If something goes horribly wrong like happened to you yesterday, it’s really not a big deal because a) you’ve spliced away from the production group and b) you can do a great deal of testing without the biggest enemy of all – time pressure – hunting you down.

    AfC
    Sydney

  3. Possible solutions for this in the MySQL context could be
    – master-master (2) or circular replication (>2 systems). This has always been possible with replication, but there are additional features in MySQL 5.0 that allow the application to not have to worry about the AUTO_INCREMENT id and such. Within limits, the application can now be oblivious to the setup (no special coding required).
    – MySQL Cluster of course, it’s possible to take down part of the cluster for whatever reason, and also do rolling upgrades.

  4. At the simplest level, I could have changed the old master my.cnf file to include read_only before shutting it down. Agreed about the back office though.

    The back office systems are likely to be split from the main production systems, in part to reduce the production system complexity.

    Nothing wrong with mentioning non-MySQL approaches. Those are potential feature list items, particularly for sites like Wikipedia which have or should have high availability requirements.

    The main productions system, MediaWiki, does have a fair amount of capability when it comes to segmentation. For example, the rollout of version 1.4 happened over several weeks, with increasing percentages of the site moving to it and the production servers running different software versions depending on the part of the site the particular request was for. Normal load sharing also lets me allocate arbitrary amounts of load to arbitrary sets of database slaves, a capability I use routinely to increase cache hit rates by segmenting the load.

    That’s not quite as much of a segmented approach as I think you’re describing, though.

    As Arjen has indicated clearly from the nature of his reply, there are some significant hurdles to doing what you’ve described in a MySQL environment and MySQL AB is aware of at least some of them.

    Monday wasn’t a horribly wrong day, though, compared to some. Any operation which refuses to die within seconds (not tens of seconds) when killed can really ruin your day. It’s been the source of most of my emergency master switches and presents an ongoing denial of service attack risk if any significant percentage become undead when killed. Fortunately not one which has materialised yet. Something MySQL AB has been and continues to work on, in a variety of ways.

Comments are closed.