Posted on August 29, 2009 by toby — 5 Comments

Will your production MySQL server survive a restart?

Do you know if your production MySQL servers will come back up when restarted? A recent support episode illustrates a number of best practices. The task looked trivial: Update a production MySQL server (replication master) with a configuration tuned and tested on a development server. Clean shutdown, change configuration, restart. Unfortunately, the MySQL daemon did not just ‘come back’, leaving 2 sites offline. Thus begins an illuminating debugging story.

First place to look is the daemon error log, which revealed that the server was segfaulting, seemingly at the end of or just after InnoDB recovery. Reverting to the previous configuration did not help, nor did changing the InnoDB recovery mode. Working with the client, we performed a failover to a replication slave, while I got a second opinion from a fellow engineer to work out what had gone wrong on the server.

Since debug symbols weren’t shown in the stack trace, we needed to generate a symbol file (binary was unstripped) to use with the resolve_stack_dump utility. The procedure for obtaining this is detailed in the MySQL manual. With a good stack trace in hand, we were able (with assistance from an old friend, thanks Dean!) to narrow the crash down to bug 38856 (also see 37027). A little further investigation showed that the right conditions did exist to trigger this bug:

expire_logs_days = 14 # had been set in the my.cnf
the binlog.index file did not match the actual state of log files (i.e. some had been manually deleted, or deleted by a script)

So with this knowledge, it was possible to bring the MySQL server back up. It turned out that the expire_logs_days had perhaps been added to the configuration but not tested at the time (the server had not been restarted for 3 months). This had placed the system in a state, unbeknownst to the administrators, where it would not come back up after a restart. It was an interesting (if a tad stressful) incident as it shows the reasons for many best practices – which most of us know and follow – but worth re-capping here.

even seemingly trivial maintenance can potentially trigger downtime
plan any production maintenance in the quiet zone, and be sure to allow enough time to deal with the unforeseen
don’t assume your live server will ‘just restart’
put my.cnf under revision control (check out “etckeeper”, a standard Ubuntu package; it can keep track of everything in /etc using bzr, svn or git)
do not make un-tested changes to config, test immediately, preferably on dev or staging system
be ready to failover (test regularly like a fire drill); this is another reason why master-master setups are more convenient than mere master-slave
replication alone is NOT a backup
don’t remove binlogs or otherwise touch anything in data dir behind mysql’s back
have only 1 admin per server so you don’t step on each other’s toes (but share credentials with 2IC for emergencies only)
use a trusted origin for your binary packages, just building and passing the basis test-suite is not always sufficient
know how to get a good stack trace with symbols, to help find bug reports
be familiar with bugs.mysql.com, but it still helps to ask others as they might have just seen something similar and can help you quickly find what you’re looking for!
and last but very important: it really pays to find the root cause to a problem (and prevention requires it!), so a “postmortum” on a dead server is very important… if we had just wiped that server, the problem might have reoccurred with another server later.

Posted on August 29, 2009 by toby — 5 Comments

5 thoughts on “Will your production MySQL server survive a restart?”

Sheeri K. Cabral
August 29, 2009

Some gotchas that should be pointed out:

* put my.cnf under revision control (check out “etckeeper”, a standard Ubuntu package; it can keep track of everything in /etc using bzr, svn or git)

Make sure that any changes to the my.cnf are checked in, so they are not written over accidentally.

Also, make sure that any parameters set dynamically are tested, and after they are tested, are put in the configuration file. (Pythian has an automated daily check that reports on differences in the configuration file(s) and the running server.)

* do not make un-tested changes to config, test immediately, preferably on dev or staging system

While this is a good point, would this have saved the problem in this case? This is the sort of issue that would probably not be caught on dev/staging because the binary logs probably weren’t deleted by hand there.

Note that expire-logs-days is a *dynamic* parameter, so had the dynamic parameter been set (not just the /etc/my.cnf parameter), this problem would have been spotted much sooner — and it would have been remembered that expire-logs-days was changed the previous day. “use dynamic parameters” is a corollary to this one, then.

* be ready to failover (test regularly like a fire drill); this is another reason why master-master setups are more convenient than mere master-slave

Using master/slave instead of master/master only adds 4 commands: stop slave; reset slave; on the old slave that is promoted to master, and change master to…. ; start slave; on the old master that is demoted to the new slave.

So take all factors into consideration with master/master replication, and then make an informed decision. Personally I’m OK with having a master/master setup where there are writes to a “primary” and reads from a “Secondary” — so the application treats it like master/slave — but I also know many who aren’t comfortable with that, and life isn’t that much harder. More important than “use master/master instead of master/slave” is the “be ready to failover” and I’ll add: Practice often. Every few months, which is usually when most organizations should upgrade or have a need for offline maintenance anyway…

* have only 1 admin per server so you don’t step on each other’s toes (but share credentials with 2IC for emergencies only)

I strongly disagree with this. The idea behind it is “don’t do anything without being transparent.” Everyone should know what is done — at Pythian we have complete transparency, so that we can pick up where a co-worker left off, and the client also gets to see what is done — many clients like to learn what we do, so they can do the same thing next time.

Having only 1 admin is a bad idea. Having a good *process* of documenting what maintenance is performed (such as deleting binary logs) and explaining why changes are made is a good thing. Otherwise your 1 admin could be the person that deletes binary logs by hand, and a 2nd admin never knows that it happened.

Better is to communicate — we have had clients let us know they deleted binary logs by hand, and we were able to go in immediately and fix things, before it became a problem.

Other than that, your list is excellent!
toby
August 30, 2009

Sheeri, I like your point that with sufficiently healthy process/communication/documentation then anyone on the team can take over. That is an ideal situation. But having 2 or more people administering a server, without good communication, can be disastrous – I think we’ve all seen that. Thanks for detailed response!
PaulM
September 1, 2009

Hi Toby,

So you passed the expire_log_days bug Boss as well and can now progress to the next level.
Welcome to the club. That MySQL can crash dump with nothing in the stack trace on a trivial error is disappointing, it makes it seem really serious, when all it is a file out of sync with the OS file/directory.

If the site availability was that serious, the current configuration sounds under-done. Doing a rolling update with the app on the other master (or promoted master) would have allowed the app to remain up without getting nailed with the loss of a single instance.
This event should be used to push for periodic failovers to test the robustness of the architecture, almost like redundancy audit.

Nice idea with change control, a secure external location to the server is better than local though, so losing the server doesn’t mean the loss of all config.

Have Fun
Paul
1. arjen
  September 2, 2009
  
  Paul, yep this is an existing system and a “work in progress”. Dual master is coming but not yet in place.
  Dual master would have prevented the issue, because it would indeed have been done as a rolling upgrade and not affected the frontend operations.
toby
September 2, 2009

Hi Paul, Love the Boss reference – very appropriate and funny!

Re: change control, yes this is something that I put in practice on another team project I work on; having switched all my personal work to Subversion 5+ years ago I cannot imagine doing without some form of VCS.

From the MySQL point of view it gives us peace of mind to have versioned schema and configuration (actually all our server config from Postfix to PHP to DNS zone files is versioned) and I would recommend it to anyone, even one-man shops. In that project’s case, the repository is not on the production server, and is also backed up to multiple offsite locations.

Comments are closed.

Share this:

Related

5 thoughts on “Will your production MySQL server survive a restart?”