Posted on

LKML: Live patching for 3.20

Building on the original kSplice idea and combining the efforts of the work done at Red Hat and SuSE, common infrastructure is now ready to be put into the Linux 3.20 mainline kernel – Red Hat and SuSE have already committed to using this.

I still reckon it’s freaky trickery, but heck – it works, and it’s great for server environments that have no redundancy (I prefer to fix that issue!) and can’t afford any downtime.

Posted on

Storage caching options in Linux 3.9 kernel

dm-cache is (albeit still classified “experimental”) is in the just released Linux 3.9 kernel. It deals with generic block devices and uses the device mapper framework. While there have been a few other similar tools flying around, since this one has been adopted into the kernel it looks like this will be the one that you’ll be seeing the most in to the future. It saves sysadmins the hassle of compiling extra stuff for a system.

A typical use is for an SSD to cache a HDD. Similar to a battery backed RAID controller, the objective is to insulate the application from latency caused by the mechanical device, the most laggy part of which is seek time (measured in milliseconds). Giventhe  relatively high storage capacity of an SSD (in the hundreds of GBs), this allows you to mostly disregard the mechanical latency for writes and that’s very useful for database systems such as MariaDB.

That covers writes (for the moment), but what about reads? Can MariaDB benefit from the read-caching? For the MyISAM storage engine, yes (as it relies on filesystem caching for speeding up row data access). For InnoDB, much less so. But let’s explore this, because it’s not quite a yes/no story – it depends. For typical systems with a correctly dimensioned system and InnoDB buffer pool, most of the active dataset will reside in RAM. For a system using a cached RAID controller that means that an actual disk read is not likely to be in the cache. With an SSD cache you might get lucky as it’s bigger – so stuff that has been read or written in some recent past may still be there. What we have found from testing with hdlatency (on actual client/hosting infra) is that SANs typically don’t have enough cache to pull that off – they too may have SSD caches now, but remember they get accessed by many more users with different data needs as well. The result of SSD filesystem caching for reads is actually similar to InnoDB tweaks that implement a secondary buffer pool on SSD storage, it creates a relatively large and cheap space for “lukewarm” pages (ones that haven’t been recently accessed).

So why does it depend? Because your active dataset might be too large, and/or your combined reads/writes are still more than the physical disks can handle. It’s very important to consider the latter: write caching insulates you from the seeks and allows an intermediate layer to re-order writes to optimise the head movement, but the writes still need to be done and thus ultimately you remain bound by an upper end physical limit. Insulation is not complete separation. If your active dataset is larger than RAM+SSD, then the reads also also need to be taken into account for seek capacity.

So right now you could say that at decent prices, if your active dataset is in the range of a few hundred GB to even a few TB, RAM with the optional addition of SSD caching can all work out nicely – what can still make it go sour is the rate of writes. Conclusion: this type of setup provides you with more headroom than a battery backed RAID controller, should you need that.

Separating reporting to distinct database servers (typically slaves, configured for relatively few connections and large queries) actually still helps quite a bit as it really changes what’s in the buffer pool and other caches. Or, differently put, looking at the access patterns of the different parts of your application is important – there are numerous variation on this basic pattern. It’s a form of functional sharding.

You’ll have noticed I didn’t mention any benchmarks when discussing all this (and most other topics). Many if not most benchmarks have artificial aspects to them, which makes them problematic when dealing with the real world. As shown above, applying background knowledge of the systems and structures, logic, and maths gets you a very long way (either independently or in consultation with us). It can get you through important decision processes quicker. Testing can still play an important part, but then it’s either part of or very close to your real world environment, not a lab activity. It will be specific to you. Don’t get trapped having to deliver on numbers from benchmarks.

Posted on

The 2012 Leap Second on Linux

Sheeri K. Cabral at the Mozilla Foundation wrote about an issue with the June 30th 2012 leap second affecting at least MySQL, Java and Minecraft servers. It now appears that the underlying cause is a Linux kernel bug, as noted by John Stultz (IBM) on the Linux Kernel mailing list, and the team Sheeri is part of deserves due credit for doing awesome pattern recognition and being the first to bring it to public attention, enabling people to quickly correlate their own experience with that of others and finding a practical solution as well as helping figure out the cause.

Sheeri’s original post MySQL and the Leap Second, High CPU and the Fix describes how MySQL servers would suddenly exhibit high CPU usage during a period of low load. From her analysis this happened from the exact time that in UTC the date would go from June 30th to July 1st, and it so happens that this year a leap second (23:59:60) is inserted.

A quick fix is

$ sudo date -s "`date`"

Obviously a system reboot works as well, but that’s rather crude. Some sysadmins roll out some form of quickfix to their servers via Puppet.

It’s important to note that merely restarting MySQL Server (or another affected service) does not resolve the problem – not surprising, since they’re all victims of the problem rather than the cause. There is a MySQL bug report for it, with the kernel list reference as its last comment.

(post updated with Sheeri’s feedback – see comment below)

Update 2012-07-04

Several Heise Online articles provide additional information on the issue.

The kernel bug means that the [high resolution timer] code fails to set the system time when the leap second is added. The result is that the hrtimer representation of the time taken from the kernel is a second ahead of the system time. If an application then calls a kernel function with a timeout of less than a second, the kernel assumes that the timeout has elapsed immediately after setting the timer, and so returns to the program code immediately. In the event of a timeout, many programs simply repeat the requested operation and immediately set a new timer. This results in an endless loop, leading to 100% CPU utilisation.

Other tidbits:

  • The issue is not related to the 2009 leap second problem, so it’s not a regression.
  • A number of kernel developers had been performing testing in recent months to see whether the 2012 leap second insertion was likely to cause problems, finding and fixing several bugs in the process.
  • The problem appears to affect all kernel versions from 2.6.26 up to and including 3.3.Google’s way of handling leap seconds by inserting fractions of the second during the day prior to the event is interesting, their method completely avoids the leap second insert. Since leap seconds (and days) always require special handling in software, code that is only required on those instances, it makes sense to avoid them altogether if that’s possible. Obviously the Google method cannot be applied to leap days, but the issues with those are of a different nature to leap second insertion. See Time, technology and leaping seconds
  • The report from the Hetzner hosting service about the issue causing a 1MW spike in electricity usage deserves consideration. With the proliferation of servers, desktop computers and embedded devices such as wireless routers, time-based bugs have the potential to cause major disruption, in this case to an electricity grid. If systems controlling the environment (like the grid) are affected also, the consequences can be even more significant.

From Open Query’s own explorations (this includes some conjecture):

  • From our own client realm it appears that many Red Hat and CentOS systems were not affected, whereas those running Debian or Ubuntu kernels were. Since distros roll their own kernels with numerous patches, this is entirely possible. As a software developer knows, even a patch serving a different purpose could somehow affect the timer behaviour, thus avoiding the problem. There’s also the real possibility that it’s a (partial) correlation not a causality.
  • Some people don’t run the NTP service. That’s not something I wouldn’t really like to recommend, as having a proper system time definitely prevents more issues than it causes, but in this particular case it may have “saved” some systems from experiencing the issue.
  • The NTP service has many settings, some of which can also affect the behaviour for this case.

In a nutshell… the real world is complex and an event involves a combination of different factors resulting in a certain behaviour. While it’s sometimes easy to identify a cause for a particular environment (one client, in our case), getting a complete picture across more clients is more than a tad harder. If you simply put the information from different clients together, the evidence can appear to be rather contradictory.

Posted on

Storage Miniconf Deadline Extended!

The organisers have given all miniconfs an additional few weeks to spruik for more proposal submissions, huzzah!

So if you didn’t submit a proposal because you weren’t sure whether you’d be able to attend LCA2010, you now have until October 23 to convince your boss to send you and get your proposal in.

Posted on

Storage miniconf at 2010

Since you were going to 2010 in Wellington, NZ anyway in January of next year, you should submit a proposal to speak at the data storage and retrieval miniconf.

If you have something to say about storage hardware, file systems, raid, lvm, databases or anything else linux or open source and storage related, please submit early and submit often!

The call for proposals is open until September 28.

Posted on

Tool of the day: inotify

I was actually exploring inotify-tools for something else, but they can also be handy for seeing what goes on below a mysqld process. inotify hooks into the filesystem handlers, and sees which files are accessed. You can then set triggers, or just display a tally over a certain period.

It has been a standard Linux kernel module since 2.6.13 (2005, wow that’s a long time ago already) and can be used through calls or the inotify-tools (commandline). So with the instrumentation already in the kernel, apt-get install inotify-tools is all you need to get started.

 # inotifywatch -v -t 20 -r /var/lib/mysql/* /var/lib/mysql/zabbix/*
Establishing watches...
Setting up watch(es) on /var/lib/mysql/mysql/user.frm
OK, /var/lib/mysql/mysql/user.frm is now being watched.
Total of 212 watches.
Finished establishing watches, now collecting statistics.
Will listen for events for 60 seconds.
total  modify  filename
2371   2371    /var/lib/mysql/
2148   2148    /var/lib/mysql/
1157   1157    /var/lib/mysql/ib_logfile0
24     24      /var/lib/mysql/zabbix/
24     24      /var/lib/mysql/zabbix/history.ibd
8      8       /var/lib/mysql/zabbix/trends_uint.ibd
6      6       /var/lib/mysql/zabbix/items.ibd
5      5       /var/lib/mysql/ibdata1

This is just a limited example from a dev box, but you can see the benefit. You can see which files have been accessed, in what way, and how many times over the specified period. Consequently this provides the most insight if you’re using innodb-file-per-table (or MyISAM) rather than a single InnoDB tablespace. But of course it depends a bit on what you’re looking for.

Posted on

Book: Pro Linux System Administration

Peter Lieverdink (also known as cafuego on IRC/, engineer on OurDelta builds and for Open Query) has co-authored a book that’s available since Monday. The title is Pro Linux System Administration published by Apress.

These days some people don’t want to bother with system administration, and either hire or outsource. Others want to find out more and do things themselves (home and small office use), and that’s the intended audience for this book.

Posted on

Update: MySQL tmpdir on tmpfs

Followup on Experiment: MySQL tmpdir on tmpfs, about tmpdir=/dev/shm in my.cnf (it’s not a dynamic variable that can be set at runtime). It’s working well, also confirmed by comments from others that they’ve been using it for a while.

This particular setting is Linux specific. On Solaris, the default /tmp is already on a tmpfs so that’s fine too. Brian reminded me that this tweak is also useful if you’re stuck with a 32-bit OS as you can then utilise some more memory in a practical way.

Extra useful hint from Harrison: if you are using replication, you will also want slave_load_tmpdir=/tmp on your slave (real disk which survives a restart). The issue is that with statement based binary logging, there are many events which create a file for a replicated LOAD DATA INFILE. If you stop your server after some of these events have occurred, but not all, it will break after you restart.

Thanks everyone for their feedback!

Posted on

Experiment: MySQL tmpdir on tmpfs

In MySQL, the tmpdir path is mainly used for disk-based sorts (if the sort_buffer_size is not enough) and disk-based temp tables. The latter cannot always be avoided even if you made tmp_table_size and max_heap_table_size quite large, since MEMORY tables don’t support TEXT/BLOB type columns, and also since you just really don’t want to run the risk of exceeding available memory by setting these things too large.

You can see how your server is doing with temporary tables, how many of those become disk tables, and also other created temporary files, by looking at SHOW GLOBAL STATUS LIKE ‘Created_tmp%’;

So why might tmpfs be better? Well, not just any tmpfs, but specifically /dev/shm. Its behaviour (see the kernel docs) may just fit the need, provided you have sufficient “spare” RAM in the machine anyway. Essentially the files will live in kernel cache space, and swap is potentially possible. You can even tweak limits, live. Mind you, this is Linux specific. Any system with glibc 2.2 or higher will have a /dev/shm available.

We’re currently running a comparative test in a production environment with one slave on tmpdir=/dev/shm. It being significantly faster is no real surprise but we just want to make sure that it behaves well over time, in the real world. Naturally it is best to not need disk-based temp tables at all for most queries, but for existing apps you can’t always change problematic design issues “right now”. It’s not a solution, but extra “breathing space” is useful. We’ll report back later with more results and possible insights from this experiment.

2009-06-16: See followup at Update: MySQL tmpdir on tmpfs.