Will your production MySQL server survive a restart?

Do you know if your production MySQL servers will come back up when restarted? A recent support episode illustrates a number of best practices. The task looked trivial: Update a production MySQL server (replication master) with a configuration tuned and tested on a development server. Clean shutdown, change configuration, restart. Unfortunately, the MySQL daemon did not just ‘come back’, leaving 2 sites offline. Thus begins an illuminating debugging story.
First place to look is the daemon error log, which revealed that the server was segfaulting, seemingly at the end of or just after InnoDB recovery. Reverting to the previous configuration did not help, nor did changing the InnoDB recovery mode. Working with the client, we performed a failover to a replication slave, while I got a second opinion from a fellow engineer to work out what had gone wrong on the server.
Since debug symbols weren’t shown in the stack trace, we needed to generate a symbol file (binary was unstripped) to use with the resolve_stack_dump utility. The procedure for obtaining this is detailed in the MySQL manual. With a good stack trace in hand, we were able (with assistance from an old friend, thanks Dean!) to narrow the crash down to bug 38856 (also see 37027). A little further investigation showed that the right conditions did exist to trigger this bug:
  • expire_logs_days = 14 # had been set in the my.cnf
  • the binlog.index file did not match the actual state of log files (i.e. some had been manually deleted, or deleted by a script)
So with this knowledge, it was possible to bring the MySQL server back up. It turned out that the expire_logs_days had perhaps been added to the configuration but not tested at the time (the server had not been restarted for 3 months). This had placed the system in a state, unbeknownst to the administrators, where it would not come back up after a restart. It was an interesting (if a tad stressful) incident as it shows the reasons for many best practices – which most of us know and follow – but worth re-capping here.
  • even seemingly trivial maintenance can potentially trigger downtime
  • plan any production maintenance in the quiet zone, and be sure to allow enough time to deal with the unforeseen
  • don’t assume your live server will ‘just restart’
  • put my.cnf under revision control (check out “etckeeper”, a standard Ubuntu package; it can keep track of everything in /etc using bzr, svn or git)
  • do not make un-tested changes to config, test immediately, preferably on dev or staging system
  • be ready to failover (test regularly like a fire drill); this is another reason why master-master setups are more convenient than mere master-slave
  • replication alone is NOT a backup
  • don’t remove binlogs or otherwise touch anything in data dir behind mysql’s back
  • have only 1 admin per server so you don’t step on each other’s toes (but share credentials with 2IC for emergencies only)
  • use a trusted origin for your binary packages, just building and passing the basis test-suite is not always sufficient
  • know how to get a good stack trace with symbols, to help find bug reports
  • be familiar with bugs.mysql.com, but it still helps to ask others as they might have just seen something similar and can help you quickly find what you’re looking for!
  • and last but very important: it really pays to find the root cause to a problem (and prevention requires it!), so a “postmortum” on a dead server is very important… if we had just wiped that server, the problem might have reoccurred with another server later.

5 Responses to “Will your production MySQL server survive a restart?”

  1. Some gotchas that should be pointed out:

    * put my.cnf under revision control (check out “etckeeper”, a standard Ubuntu package; it can keep track of everything in /etc using bzr, svn or git)

    Make sure that any changes to the my.cnf are checked in, so they are not written over accidentally.

    Also, make sure that any parameters set dynamically are tested, and after they are tested, are put in the configuration file. (Pythian has an automated daily check that reports on differences in the configuration file(s) and the running server.)

    * do not make un-tested changes to config, test immediately, preferably on dev or staging system

    While this is a good point, would this have saved the problem in this case? This is the sort of issue that would probably not be caught on dev/staging because the binary logs probably weren’t deleted by hand there.

    Note that expire-logs-days is a *dynamic* parameter, so had the dynamic parameter been set (not just the /etc/my.cnf parameter), this problem would have been spotted much sooner — and it would have been remembered that expire-logs-days was changed the previous day. “use dynamic parameters” is a corollary to this one, then.

    * be ready to failover (test regularly like a fire drill); this is another reason why master-master setups are more convenient than mere master-slave

    Using master/slave instead of master/master only adds 4 commands: stop slave; reset slave; on the old slave that is promoted to master, and change master to…. ; start slave; on the old master that is demoted to the new slave.

    So take all factors into consideration with master/master replication, and then make an informed decision. Personally I’m OK with having a master/master setup where there are writes to a “primary” and reads from a “Secondary” — so the application treats it like master/slave — but I also know many who aren’t comfortable with that, and life isn’t that much harder. More important than “use master/master instead of master/slave” is the “be ready to failover” and I’ll add: Practice often. Every few months, which is usually when most organizations should upgrade or have a need for offline maintenance anyway…

    * have only 1 admin per server so you don’t step on each other’s toes (but share credentials with 2IC for emergencies only)

    I strongly disagree with this. The idea behind it is “don’t do anything without being transparent.” Everyone should know what is done — at Pythian we have complete transparency, so that we can pick up where a co-worker left off, and the client also gets to see what is done — many clients like to learn what we do, so they can do the same thing next time.

    Having only 1 admin is a bad idea. Having a good *process* of documenting what maintenance is performed (such as deleting binary logs) and explaining why changes are made is a good thing. Otherwise your 1 admin could be the person that deletes binary logs by hand, and a 2nd admin never knows that it happened.

    Better is to communicate — we have had clients let us know they deleted binary logs by hand, and we were able to go in immediately and fix things, before it became a problem.

    Other than that, your list is excellent!

  2. Sheeri, I like your point that with sufficiently healthy process/communication/documentation then anyone on the team can take over. That is an ideal situation. But having 2 or more people administering a server, without good communication, can be disastrous – I think we’ve all seen that. Thanks for detailed response!

  3. Hi Toby,

    So you passed the expire_log_days bug Boss as well and can now progress to the next level.
    Welcome to the club. That MySQL can crash dump with nothing in the stack trace on a trivial error is disappointing, it makes it seem really serious, when all it is a file out of sync with the OS file/directory.

    If the site availability was that serious, the current configuration sounds under-done. Doing a rolling update with the app on the other master (or promoted master) would have allowed the app to remain up without getting nailed with the loss of a single instance.
    This event should be used to push for periodic failovers to test the robustness of the architecture, almost like redundancy audit.

    Nice idea with change control, a secure external location to the server is better than local though, so losing the server doesn’t mean the loss of all config.

    Have Fun
    Paul

  4. Paul, yep this is an existing system and a “work in progress”. Dual master is coming but not yet in place.
    Dual master would have prevented the issue, because it would indeed have been done as a rolling upgrade and not affected the frontend operations.

  5. Hi Paul, Love the Boss reference – very appropriate and funny!

    Re: change control, yes this is something that I put in practice on another team project I work on; having switched all my personal work to Subversion 5+ years ago I cannot imagine doing without some form of VCS.

    From the MySQL point of view it gives us peace of mind to have versioned schema and configuration (actually all our server config from Postfix to PHP to DNS zone files is versioned) and I would recommend it to anyone, even one-man shops. In that project’s case, the repository is not on the production server, and is also backed up to multiple offsite locations.

Leave a Comment