Do you know if your production MySQL servers will come back up when restarted? A recent support episode illustrates a number of best practices. The task looked trivial: Update a production MySQL server (replication master) with a configuration tuned and tested on a development server. Clean shutdown, change configuration, restart. Unfortunately, the MySQL daemon did not just ‘come back’, leaving 2 sites offline. Thus begins an illuminating debugging story.
First place to look is the daemon error log, which revealed that the server was segfaulting, seemingly at the end of or just after InnoDB recovery. Reverting to the previous configuration did not help, nor did changing the InnoDB recovery mode. Working with the client, we performed a failover to a replication slave, while I got a second opinion from a fellow engineer to work out what had gone wrong on the server.
Since debug symbols weren’t shown in the stack trace, we needed to generate a symbol file (binary was unstripped) to use with the resolve_stack_dump utility. The procedure for obtaining this is detailed in the MySQL manual. With a good stack trace in hand, we were able (with assistance from an old friend, thanks Dean!) to narrow the crash down to bug 38856 (also see 37027). A little further investigation showed that the right conditions did exist to trigger this bug:
- expire_logs_days = 14 # had been set in the my.cnf
- the binlog.index file did not match the actual state of log files (i.e. some had been manually deleted, or deleted by a script)
So with this knowledge, it was possible to bring the MySQL server back up. It turned out that the expire_logs_days had perhaps been added to the configuration but not tested at the time (the server had not been restarted for 3 months). This had placed the system in a state, unbeknownst to the administrators, where it would not come back up after a restart. It was an interesting (if a tad stressful) incident as it shows the reasons for many best practices – which most of us know and follow – but worth re-capping here.
- even seemingly trivial maintenance can potentially trigger downtime
- plan any production maintenance in the quiet zone, and be sure to allow enough time to deal with the unforeseen
- don’t assume your live server will ‘just restart’
- put my.cnf under revision control (check out “etckeeper”, a standard Ubuntu package; it can keep track of everything in /etc using bzr, svn or git)
- do not make un-tested changes to config, test immediately, preferably on dev or staging system
- be ready to failover (test regularly like a fire drill); this is another reason why master-master setups are more convenient than mere master-slave
- replication alone is NOT a backup
- don’t remove binlogs or otherwise touch anything in data dir behind mysql’s back
- have only 1 admin per server so you don’t step on each other’s toes (but share credentials with 2IC for emergencies only)
- use a trusted origin for your binary packages, just building and passing the basis test-suite is not always sufficient
- know how to get a good stack trace with symbols, to help find bug reports
- be familiar with bugs.mysql.com, but it still helps to ask others as they might have just seen something similar and can help you quickly find what you’re looking for!
- and last but very important: it really pays to find the root cause to a problem (and prevention requires it!), so a “postmortum” on a dead server is very important… if we had just wiped that server, the problem might have reoccurred with another server later.