Posted on

The 2012 Leap Second on Linux

Sheeri K. Cabral at the Mozilla Foundation wrote about an issue with the June 30th 2012 leap second affecting at least MySQL, Java and Minecraft servers. It now appears that the underlying cause is a Linux kernel bug, as noted by John Stultz (IBM) on the Linux Kernel mailing list, and the team Sheeri is part of deserves due credit for doing awesome pattern recognition and being the first to bring it to public attention, enabling people to quickly correlate their own experience with that of others and finding a practical solution as well as helping figure out the cause.

Sheeri’s original post MySQL and the Leap Second, High CPU and the Fix describes how MySQL servers would suddenly exhibit high CPU usage during a period of low load. From her analysis this happened from the exact time that in UTC the date would go from June 30th to July 1st, and it so happens that this year a leap second (23:59:60) is inserted.

A quick fix is

$ sudo date -s "`date`"

Obviously a system reboot works as well, but that’s rather crude. Some sysadmins roll out some form of quickfix to their servers via Puppet.

It’s important to note that merely restarting MySQL Server (or another affected service) does not resolve the problem – not surprising, since they’re all victims of the problem rather than the cause. There is a MySQL bug report for it, with the kernel list reference as its last comment.

(post updated with Sheeri’s feedback – see comment below)

Update 2012-07-04

Several Heise Online articles provide additional information on the issue.

The kernel bug means that the [high resolution timer] code fails to set the system time when the leap second is added. The result is that the hrtimer representation of the time taken from the kernel is a second ahead of the system time. If an application then calls a kernel function with a timeout of less than a second, the kernel assumes that the timeout has elapsed immediately after setting the timer, and so returns to the program code immediately. In the event of a timeout, many programs simply repeat the requested operation and immediately set a new timer. This results in an endless loop, leading to 100% CPU utilisation.

Other tidbits:

  • The issue is not related to the 2009 leap second problem, so it’s not a regression.
  • A number of kernel developers had been performing testing in recent months to see whether the 2012 leap second insertion was likely to cause problems, finding and fixing several bugs in the process.
  • The problem appears to affect all kernel versions from 2.6.26 up to and including 3.3.Google’s way of handling leap seconds by inserting fractions of the second during the day prior to the event is interesting, their method completely avoids the leap second insert. Since leap seconds (and days) always require special handling in software, code that is only required on those instances, it makes sense to avoid them altogether if that’s possible. Obviously the Google method cannot be applied to leap days, but the issues with those are of a different nature to leap second insertion. See Time, technology and leaping seconds
  • The report from the Hetzner hosting service about the issue causing a 1MW spike in electricity usage deserves consideration. With the proliferation of servers, desktop computers and embedded devices such as wireless routers, time-based bugs have the potential to cause major disruption, in this case to an electricity grid. If systems controlling the environment (like the grid) are affected also, the consequences can be even more significant.

From Open Query’s own explorations (this includes some conjecture):

  • From our own client realm it appears that many Red Hat and CentOS systems were not affected, whereas those running Debian or Ubuntu kernels were. Since distros roll their own kernels with numerous patches, this is entirely possible. As a software developer knows, even a patch serving a different purpose could somehow affect the timer behaviour, thus avoiding the problem. There’s also the real possibility that it’s a (partial) correlation not a causality.
  • Some people don’t run the NTP service. That’s not something I wouldn’t really like to recommend, as having a proper system time definitely prevents more issues than it causes, but in this particular case it may have “saved” some systems from experiencing the issue.
  • The NTP service has many settings, some of which can also affect the behaviour for this case.

In a nutshell… the real world is complex and an event involves a combination of different factors resulting in a certain behaviour. While it’s sometimes easy to identify a cause for a particular environment (one client, in our case), getting a complete picture across more clients is more than a tad harder. If you simply put the information from different clients together, the evidence can appear to be rather contradictory.