Posted on 5 Comments

Storage caching options in Linux 3.9 kernel

dm-cache is (albeit still classified “experimental”) is in the just released Linux 3.9 kernel. It deals with generic block devices and uses the device mapper framework. While there have been a few other similar tools flying around, since this one has been adopted into the kernel it looks like this will be the one that you’ll be seeing the most in to the future. It saves sysadmins the hassle of compiling extra stuff for a system.

A typical use is for an SSD to cache a HDD. Similar to a battery backed RAID controller, the objective is to insulate the application from latency caused by the mechanical device, the most laggy part of which is seek time (measured in milliseconds). Giventhe  relatively high storage capacity of an SSD (in the hundreds of GBs), this allows you to mostly disregard the mechanical latency for writes and that’s very useful for database systems such as MariaDB.

That covers writes (for the moment), but what about reads? Can MariaDB benefit from the read-caching? For the MyISAM storage engine, yes (as it relies on filesystem caching for speeding up row data access). For InnoDB, much less so. But let’s explore this, because it’s not quite a yes/no story – it depends. For typical systems with a correctly dimensioned system and InnoDB buffer pool, most of the active dataset will reside in RAM. For a system using a cached RAID controller that means that an actual disk read is not likely to be in the cache. With an SSD cache you might get lucky as it’s bigger – so stuff that has been read or written in some recent past may still be there. What we have found from testing with hdlatency (on actual client/hosting infra) is that SANs typically don’t have enough cache to pull that off – they too may have SSD caches now, but remember they get accessed by many more users with different data needs as well. The result of SSD filesystem caching for reads is actually similar to InnoDB tweaks that implement a secondary buffer pool on SSD storage, it creates a relatively large and cheap space for “lukewarm” pages (ones that haven’t been recently accessed).

So why does it depend? Because your active dataset might be too large, and/or your combined reads/writes are still more than the physical disks can handle. It’s very important to consider the latter: write caching insulates you from the seeks and allows an intermediate layer to re-order writes to optimise the head movement, but the writes still need to be done and thus ultimately you remain bound by an upper end physical limit. Insulation is not complete separation. If your active dataset is larger than RAM+SSD, then the reads also also need to be taken into account for seek capacity.

So right now you could say that at decent prices, if your active dataset is in the range of a few hundred GB to even a few TB, RAM with the optional addition of SSD caching can all work out nicely – what can still make it go sour is the rate of writes. Conclusion: this type of setup provides you with more headroom than a battery backed RAID controller, should you need that.

Separating reporting to distinct database servers (typically slaves, configured for relatively few connections and large queries) actually still helps quite a bit as it really changes what’s in the buffer pool and other caches. Or, differently put, looking at the access patterns of the different parts of your application is important – there are numerous variation on this basic pattern. It’s a form of functional sharding.

You’ll have noticed I didn’t mention any benchmarks when discussing all this (and most other topics). Many if not most benchmarks have artificial aspects to them, which makes them problematic when dealing with the real world. As shown above, applying background knowledge of the systems and structures, logic, and maths gets you a very long way (either independently or in consultation with us). It can get you through important decision processes quicker. Testing can still play an important part, but then it’s either part of or very close to your real world environment, not a lab activity. It will be specific to you. Don’t get trapped having to deliver on numbers from benchmarks.

Posted on 5 Comments

5 thoughts on “Storage caching options in Linux 3.9 kernel

  1. Hi Arjen,

    Is there any version of hdlatency available to get downloaded?

    Thanks,

  2. Do you know of any instructions for using the new 3.9 dm-cache feature? I am unable to find anything. There is another project calling itself dm-cache, but I’m not sure if it’s the same one, as it makes no mention of being included in the mainline kernel or what is different about using it if it is.

    Thanks!

    1. Good point. Here it is:

      https://lwn.net/Articles/540996/ (Documentation/device-mapper/cache.txt [dm-cache])

      See the examples near the bottom of the article.

  3. Thanks for the link. Is there any special utilities that need to be installed in order to get this to work? I tried the example command after upgrading to 3.9, but it didn’t like it.

Comments are closed.