Posted on

innodb_flush_logs_on_trx_commit and Galera Cluster

We deploy Galera Cluster (in MariaDB) for some clients, and innodb_flush_logs_on_trx_commit is one of the settings we’ve been playing with. The options according to the manual:

  • =0 don’t write or flush at commit, write and flush once per second
  • =1 write and flush at trx commit
  • =2 write log, but only flush once per second

The flush (fsync) refers to the mechanism the filesystem uses to try and guarantee that written data is actually on the physical medium/device and not just in a buffer (of course cached RAID controllers, SANs and other devices use some different logic there, but it’s definitely written beyond the OS space).

In a non-cluster setup, you’d always want it to be =1 in order to be ACID compliant and that’s also InnoDB’s default. So far so good. For cluster setups, you could be more lenient with this as you require ACID on the cluster as a whole, not each individual machine – after all, if one machine drops out at any point, you don’t lose any data.

Codership docu recommended =2, so that’s what Open Query engineer Peter Lock initially used for some tests that he was conducting. However, performance wasn’t particularly shiny – actually not much higher than =1. That in itself is interesting, because typically we regard the # of fsyncs/second a storage system can deal with as a key indicator of performance capacity. That is, as our HD Latency tool shows when you run it on a storage device (even your local laptop harddisk), the most prominent aspect of what limits the # of writes you can do per second appears to be the fsyncs.

I then happened to chat with Oli Sennhauser (former colleague from MySQL AB) who now runs the FromDual MySQL/MariaDB consulting firm in Switzerland, and he’s been working with Galera for quite a long time. He recognised the pattern and said that he too had that experience, and he thought =0 might be the better option.

I delved into the InnoDB source code to see what was actually happening, and the code indeed concurs with what’s described in the manual (that hasn’t always been the case ;-). I also verified this with Jeremy Cole whom we may happily regard as guru on “how InnoDB actually works”. The once-per-second flush (and optional preceding write) is performed by the InnoDB master thread. Take a peek in log/log0log.c and trx/trx0trx.c, specifically trx_commit_off_kernel() and srv_sync_log_buffer_in_background().

In conclusion:

  1. Even with =0, the log does get written and flushed once per second. This is done in the background so connection threads don’t have to wait for it.
  2. There is no setting where there is never a flush/fsync.
  3. With =2, the writing of the log takes place in the connection thread and this appears to incur a significant overhead, at least relative to =0. Aside from the writing of the log at transaction commit, there doesn’t appear to be a difference.
  4. Based on the preceding points, I would say that if you don’t want =1, you might as well set =0 in order to get the performance you’re after. There is of course a slight practical difference between =0 and =2. With =2 the log is immediately written. If the mysqld process were to crash within a second after that, the OS would close the file and have that log write stored. With =0 that log data wouldn’t have been written. If the OS or machine fails, that log write is lost either way.

In production environments, we tend to mainly want to mitigate trouble from system failures, so =0 appears to be a suitable/appropriate option – for a Galera cluster environment.

What remains is the question of why the log write operation appears to reduce transaction commit performance so much, in a way more so than the flush/fsync. Something to investigate further!
Your thoughts?

Posted on
Posted on 2 Comments

HDlatency – now with quick option

I’ve done a minor update to the hdlatency tool (get it from Launchpad), it now has a –quick option to have it only do its tests with 16KB blocks rather than a whole range of sizes. This is much quicker, and 16KB is the InnoDB page size so it’s the most relevant for MySQL/MariaDB deployments.

However, I didn’t just remove the other stuff, because it can be very helpful in tracking down problems and putting misconceptions to rest. On SANs (and local RAID of course) you have things like block sizes and stripe sizes, and opinions on what might be faster. Interestingly, the real world doesn’t always agree with the opinions.

We Mark Callaghan correctly pointed out when I first published it, hdlatency does not provide anything new in terms of functionality, the db IO tests of sysbench cover it all. A key advantage of hdlatency is that it doesn’t have any dependencies, it’s a small single piece of C code that’ll compile on or can run on very minimalistic environments. We often don’t control what the base environment we have to work on is, so that’s why hdlatency was initially written. It’s just a quick little tool that does the job.

We find hdlatency particularly useful for comparing environments, primarily at the same client. For instance, the client might consider moving from one storage solution to another – well, in that case it’s useful to know whether we can expect an actual performance benefit.

The burst data rate (big sequential read or write) which often gets quoted for a SAN or even an individual disk is of little interest to database use, since its key performance bottleneck lies in random access I/O. The disk head(s) will need to move. So it’s important to get some real relevant numbers, rather than just go with magic vendor numbers that are not really relevant to you. Also, you can have a fast storage system attached via a slow interface, and consequentially the performance then will not be at all what you’d want to see. It can be quite bad.

To get an absolute baseline on what are sane numbers, run hdlatency also on a local desktop HD. This may seem odd, but you might well encounter storage systems that show a lower performance than that. ‘nuf said.

If you’re willing to share, I’d be quite interested in seeing some (–quick) output data from you – just make sure you tell what storage it is: type of interface, etc. Simply drop it in a comment to this post, so it can benefit more people. thanks

Posted on 2 Comments
Posted on 5 Comments

Measuring HD latency in ways relevant to MySQL

As I described yesterday, Open Query is doing some tests on SSDs and other devices pretending to be harddisks (SANs, battery-backed RAID controllers, etc). To aid this, I wrote a small tool to test the different kind of I/O operations MySQL would/could do, which is not quite the same as what other general purpose apps would do, and also not what other test tools measure. For instance, it tries Direct I/O as well as fsync() after each write, and also it a range of different I/O block sizes.

In a nutshell, it’s aimed to do what MySQL does, without MySQL! Testing lots of different setups for this particular purpose (even with fantastic tools like MySQL Sandbox) is a complete pest, and changing InnoDB page size requires a recompile. While Percona has tried a larger page size in the past and decided it wasn’t worth it (the default is 16K), I thought it worthwhile to include such a test as the situation may change over time with different devices.

So, this is a little tool for a very specific purpose, and it should not grow beyond that – but do feel free to abuse it for whatever other purpose you reckon fits a similar approach. Oh, and it outputs CSV for easy graphing. To grab the code, go to the hdlatency project on Launchpad. It’s plain C, and GPLv3 licensed.

Posted on 5 Comments