Yesterday was Melbourne Cup day in Australia – the biggest annual horse race event in the country, and in the state of Victoria it’s even a public holiday. Open Query does work for Luxbet (part of Tabcorp), and Melbourne Cup day is by far their biggest day of the year in terms of traffic. It’s not just a big spike, there’s orders of magnitude difference so you can really say that the rest of the year is downright quiet (in relative terms). So, a very interesting load pattern. Since last year Luxbet has upgraded from stock MySQL to MariaDB, and with our input made some other infrastructure modifications including moving to a pure solid state storage (FusionIO) solution as a SAN just won’t deliver the resilience and performance required. This may seem odd, but remember that a) a SAN is also a single point of failure (so when the SAN fails, multiple db servers will be “out” – not desirable even though a failover to another datacenter is possible), and b) MariaDB/XtraDB (InnoDB) already have all recent data and indexes in RAM, so whatever I/O is required won’t benefit from a SAN cache. Thus, the SAN will have to actually do a physical disk seek and read to get what is needed, and we all know seeks are slow. A write or fsync also incurs some latency, regardless of the storage array speed. So those are the reasons for the local storage solution. While there are aspects of RAID and other redundancy in that setup, the main resilience in the infrastructure comes from having more machines, rather than necessarily having more redundancy in each machine. Grant is working on a more comprehensive version of this story.
dm-cache is (albeit still classified “experimental”) is in the just released Linux 3.9 kernel. It deals with generic block devices and uses the device mapper framework. While there have been a few other similar tools flying around, since this one has been adopted into the kernel it looks like this will be the one that you’ll be seeing the most in to the future. It saves sysadmins the hassle of compiling extra stuff for a system. A typical use is for an SSD to cache a HDD. Similar to a battery backed RAID controller, the objective is to insulate the application from latency caused by the mechanical device, the most laggy part of which is seek time (measured in milliseconds). Giventhe relatively high storage capacity of an SSD (in the hundreds of GBs), this allows you to mostly disregard the mechanical latency for writes and that’s very useful for database systems such as MariaDB. That covers writes (for the moment), but what about reads? Can MariaDB benefit from the read-caching? For the MyISAM storage engine, yes (as it relies on filesystem caching for speeding up row data access). For InnoDB, much less so. But let’s explore this, because it’s not quite a yes/no story – it depends. For typical systems with a correctly dimensioned system and InnoDB buffer pool, most of the active dataset will reside in RAM. For a system using a cached RAID controller that means that an actual disk read is not likely to be in the cache. With an SSD cache you might get lucky as it’s bigger – so stuff that has been read or written in some recent past may still be there. What we have found from testing with hdlatency (on actual client/hosting infra) is that SANs typically don’t have enough cache to pull that off – they too may have SSD caches now, but remember they get accessed by many more users with different data needs as well. The result of SSD filesystem caching for reads is actually similar to InnoDB tweaks that implement a secondary buffer pool on SSD storage, it creates a relatively large and cheap space for “lukewarm” pages (ones that haven’t been recently accessed). So why does it depend? Because your active dataset might be too large, and/or your combined reads/writes are still more than the physical disks can handle. It’s very important to consider the latter: write caching insulates you from the seeks and allows an intermediate layer to re-order writes to optimise the head movement, but the writes still need to be done and thus ultimately you remain bound by an upper end physical limit. Insulation is not complete separation. If your active dataset is larger than RAM+SSD, then the reads also also need to be taken into account for seek capacity. So right now you could say that at decent prices, if your active dataset is in the range of a few hundred GB to even a few TB, RAM with the optional addition of SSD caching can all work out nicely – what can still make it go sour is the rate of writes. Conclusion: this type of setup provides you with more headroom than a battery backed RAID controller, should you need that. Separating reporting to distinct database servers (typically slaves, configured for relatively few connections and large queries) actually still helps quite a bit as it really changes what’s in the buffer pool and other caches. Or, differently put, looking at the access patterns of the different parts of your application is important – there are numerous variation on this basic pattern. It’s a form of functional sharding. You’ll have noticed I didn’t mention any benchmarks when discussing all this (and most other topics). Many if not most benchmarks have artificial aspects to them, which makes them problematic when dealing with the real world. As shown above, applying background knowledge of the systems and structures, logic, and maths gets you a very long way (either independently or in consultation with us). It can get you through important decision processes quicker. Testing can still play an important part, but then it’s either part of or very close to your real world environment, not a lab activity. It will be specific to you. Don’t get trapped having to deliver on numbers from benchmarks.
As I described yesterday, Open Query is doing some tests on SSDs and other devices pretending to be harddisks (SANs, battery-backed RAID controllers, etc). To aid this, I wrote a small tool to test the different kind of I/O operations MySQL would/could do, which is not quite the same as what other general purpose apps would do, and also not what other test tools measure. For instance, it tries Direct I/O as well as fsync() after each write, and also it a range of different I/O block sizes. In a nutshell, it’s aimed to do what MySQL does, without MySQL! Testing lots of different setups for this particular purpose (even with fantastic tools like MySQL Sandbox) is a complete pest, and changing InnoDB page size requires a recompile. While Percona has tried a larger page size in the past and decided it wasn’t worth it (the default is 16K), I thought it worthwhile to include such a test as the situation may change over time with different devices. So, this is a little tool for a very specific purpose, and it should not grow beyond that – but do feel free to abuse it for whatever other purpose you reckon fits a similar approach. Oh, and it outputs CSV for easy graphing. To grab the code, go to the hdlatency project on Launchpad. It’s plain C, and GPLv3 licensed.
Open Query too is exploring utilising SSDs in a MySQL infrastructure, but we wouldn’t be us if we didn’t also try some alternative perspective on it. Right now we’re running some comparative tests against various spinning HD setups in the same box, using the same controller, so we’re looking for differences rather than absolute speed. The results so far are interesting, but the selection of SSDs we have available is limited (never enough toys!) So, a request: do you have an SSD, it’d be great if we could run our test tool on it for a bit. It won’t take long, but naturally the box shouldn’t be used for something else while the test is running. We can either log in remotely, or exchange code and results over email. Simply contact us through our site’s contact form, and we’ll sort things out! Thanks. If you work for a vendor and would like to have your gear put through a bit of real world stress, please let us know also. Our reference architecture will definitely contain brand/model information as the performance and other aspects of SSDs varies widely.