Posted on

LKML: Live patching for 3.20

Building on the original kSplice idea and combining the efforts of the work done at Red Hat and SuSE, common infrastructure is now ready to be put into the Linux 3.20 mainline kernel – Red Hat and SuSE have already committed to using this.

I still reckon it’s freaky trickery, but heck – it works, and it’s great for server environments that have no redundancy (I prefer to fix that issue!) and can’t afford any downtime.

Posted on

Storage caching options in Linux 3.9 kernel

dm-cache is (albeit still classified “experimental”) is in the just released Linux 3.9 kernel. It deals with generic block devices and uses the device mapper framework. While there have been a few other similar tools flying around, since this one has been adopted into the kernel it looks like this will be the one that you’ll be seeing the most in to the future. It saves sysadmins the hassle of compiling extra stuff for a system.

A typical use is for an SSD to cache a HDD. Similar to a battery backed RAID controller, the objective is to insulate the application from latency caused by the mechanical device, the most laggy part of which is seek time (measured in milliseconds). Giventhe  relatively high storage capacity of an SSD (in the hundreds of GBs), this allows you to mostly disregard the mechanical latency for writes and that’s very useful for database systems such as MariaDB.

That covers writes (for the moment), but what about reads? Can MariaDB benefit from the read-caching? For the MyISAM storage engine, yes (as it relies on filesystem caching for speeding up row data access). For InnoDB, much less so. But let’s explore this, because it’s not quite a yes/no story – it depends. For typical systems with a correctly dimensioned system and InnoDB buffer pool, most of the active dataset will reside in RAM. For a system using a cached RAID controller that means that an actual disk read is not likely to be in the cache. With an SSD cache you might get lucky as it’s bigger – so stuff that has been read or written in some recent past may still be there. What we have found from testing with hdlatency (on actual client/hosting infra) is that SANs typically don’t have enough cache to pull that off – they too may have SSD caches now, but remember they get accessed by many more users with different data needs as well. The result of SSD filesystem caching for reads is actually similar to InnoDB tweaks that implement a secondary buffer pool on SSD storage, it creates a relatively large and cheap space for “lukewarm” pages (ones that haven’t been recently accessed).

So why does it depend? Because your active dataset might be too large, and/or your combined reads/writes are still more than the physical disks can handle. It’s very important to consider the latter: write caching insulates you from the seeks and allows an intermediate layer to re-order writes to optimise the head movement, but the writes still need to be done and thus ultimately you remain bound by an upper end physical limit. Insulation is not complete separation. If your active dataset is larger than RAM+SSD, then the reads also also need to be taken into account for seek capacity.

So right now you could say that at decent prices, if your active dataset is in the range of a few hundred GB to even a few TB, RAM with the optional addition of SSD caching can all work out nicely – what can still make it go sour is the rate of writes. Conclusion: this type of setup provides you with more headroom than a battery backed RAID controller, should you need that.

Separating reporting to distinct database servers (typically slaves, configured for relatively few connections and large queries) actually still helps quite a bit as it really changes what’s in the buffer pool and other caches. Or, differently put, looking at the access patterns of the different parts of your application is important – there are numerous variation on this basic pattern. It’s a form of functional sharding.

You’ll have noticed I didn’t mention any benchmarks when discussing all this (and most other topics). Many if not most benchmarks have artificial aspects to them, which makes them problematic when dealing with the real world. As shown above, applying background knowledge of the systems and structures, logic, and maths gets you a very long way (either independently or in consultation with us). It can get you through important decision processes quicker. Testing can still play an important part, but then it’s either part of or very close to your real world environment, not a lab activity. It will be specific to you. Don’t get trapped having to deliver on numbers from benchmarks.

Posted on

The 2012 Leap Second on Linux

Sheeri K. Cabral at the Mozilla Foundation wrote about an issue with the June 30th 2012 leap second affecting at least MySQL, Java and Minecraft servers. It now appears that the underlying cause is a Linux kernel bug, as noted by John Stultz (IBM) on the Linux Kernel mailing list, and the team Sheeri is part of deserves due credit for doing awesome pattern recognition and being the first to bring it to public attention, enabling people to quickly correlate their own experience with that of others and finding a practical solution as well as helping figure out the cause.

Sheeri’s original post MySQL and the Leap Second, High CPU and the Fix describes how MySQL servers would suddenly exhibit high CPU usage during a period of low load. From her analysis this happened from the exact time that in UTC the date would go from June 30th to July 1st, and it so happens that this year a leap second (23:59:60) is inserted.

A quick fix is

$ sudo date -s "`date`"

Obviously a system reboot works as well, but that’s rather crude. Some sysadmins roll out some form of quickfix to their servers via Puppet.

It’s important to note that merely restarting MySQL Server (or another affected service) does not resolve the problem – not surprising, since they’re all victims of the problem rather than the cause. There is a MySQL bug report for it, with the kernel list reference as its last comment.

(post updated with Sheeri’s feedback – see comment below)

Update 2012-07-04

Several Heise Online articles provide additional information on the issue.

The kernel bug means that the [high resolution timer] code fails to set the system time when the leap second is added. The result is that the hrtimer representation of the time taken from the kernel is a second ahead of the system time. If an application then calls a kernel function with a timeout of less than a second, the kernel assumes that the timeout has elapsed immediately after setting the timer, and so returns to the program code immediately. In the event of a timeout, many programs simply repeat the requested operation and immediately set a new timer. This results in an endless loop, leading to 100% CPU utilisation.

Other tidbits:

  • The issue is not related to the 2009 leap second problem, so it’s not a regression.
  • A number of kernel developers had been performing testing in recent months to see whether the 2012 leap second insertion was likely to cause problems, finding and fixing several bugs in the process.
  • The problem appears to affect all kernel versions from 2.6.26 up to and including 3.3.Google’s way of handling leap seconds by inserting fractions of the second during the day prior to the event is interesting, their method completely avoids the leap second insert. Since leap seconds (and days) always require special handling in software, code that is only required on those instances, it makes sense to avoid them altogether if that’s possible. Obviously the Google method cannot be applied to leap days, but the issues with those are of a different nature to leap second insertion. See Time, technology and leaping seconds
  • The report from the Hetzner hosting service about the issue causing a 1MW spike in electricity usage deserves consideration. With the proliferation of servers, desktop computers and embedded devices such as wireless routers, time-based bugs have the potential to cause major disruption, in this case to an electricity grid. If systems controlling the environment (like the grid) are affected also, the consequences can be even more significant.

From Open Query’s own explorations (this includes some conjecture):

  • From our own client realm it appears that many Red Hat and CentOS systems were not affected, whereas those running Debian or Ubuntu kernels were. Since distros roll their own kernels with numerous patches, this is entirely possible. As a software developer knows, even a patch serving a different purpose could somehow affect the timer behaviour, thus avoiding the problem. There’s also the real possibility that it’s a (partial) correlation not a causality.
  • Some people don’t run the NTP service. That’s not something I wouldn’t really like to recommend, as having a proper system time definitely prevents more issues than it causes, but in this particular case it may have “saved” some systems from experiencing the issue.
  • The NTP service has many settings, some of which can also affect the behaviour for this case.

In a nutshell… the real world is complex and an event involves a combination of different factors resulting in a certain behaviour. While it’s sometimes easy to identify a cause for a particular environment (one client, in our case), getting a complete picture across more clients is more than a tad harder. If you simply put the information from different clients together, the evidence can appear to be rather contradictory.