Posted on

LKML: Live patching for 3.20

Building on the original kSplice idea and combining the efforts of the work done at Red Hat and SuSE, common infrastructure is now ready to be put into the Linux 3.20 mainline kernel – Red Hat and SuSE have already committed to using this.

I still reckon it’s freaky trickery, but heck – it works, and it’s great for server environments that have no redundancy (I prefer to fix that issue!) and can’t afford any downtime.

Posted on

Improving InnoDB index statistics

The MySQL/MariaDB optimiser likes to know things like the cardinality of an index – that is, the number of distinct values the index holds. For a PRIMARY KEY, which only has unique values, the number is the same as the number of rows.  For an indexed column that is boolean (such as yes/no) the cardinality would be 2.

There’s more to it than that, but the point is that the optimiser needs some statistics from indexes in order to try and make somewhat sane decisions about which index to use for a particular query. The statistics also need to be updated when a significant number of rows have been added, deleted, or modified.

In MyISAM, ANALYZE TABLE does a tablescan where everything is tallied, and the index stats are updated. InnoDB, on the other hand, has always done “index dives”, looking at a small sample and deriving from that. That can be ok as a methodology, but unfortunately the history is awkward. The number used to be a constant in the code (4), and that is inadequate for larger tables. Later the number was made a server variable innodb_stats_sample_pages and its default is now 8 – but that’s still really not enough for big(ger) tables.

We recently encountered this issue again with a client, and this time it really needed addressing as no workarounds were effective across the number of servers and of course over time. Open Query engineer Daniel filed which was picked up by MariaDB developed Jan Lindström.

Why not just set the innodb_stats_sample_pages much higher? Well, every operation takes time, so setting the number appropriate for your biggest table means that the sampling would take unnecessarily long for all the other (smaller, or even much smaller) tables. And that’s just unfortunate.

So why doesn’t InnoDB just scale the sample size along with the table size? Because, historically, it didn’t know the table size: InnoDB does not maintain a row count (this has to do with its multi-versioned architecture and other practicalities – as with everything, it’s a trade-off). However, these days we have persistent stats tables – rather than redoing the stats the first time a table is opened after server restart, they’re stored in a table. Good improvement. As part of that information, InnoDB now also knows how many index pages (and leaf nodes in its B+Tree) it has for each table. And while that’s not the same as a row count (rows have a variable length so there’s no fixed number of rows per index page), at least it grows along with the table. So now we have something to work with! The historical situation is no longer a hindrance.

In order to scale the sample size sanely, that is not have either too large a number for small tables, or a number for big tables that’s over the top, we’ll want some kind of logarithmic scale. For instance, log2(16 thousand) = 14, and log2(1 billion) = 30. That’s small enough to be workable. The new code as I suggested:

n_sample_pages = max(min(srv_stats_sample_pages, index->stat_index_size), log2(index->stat_index_size) * srv_stats_sample_pages);

This is a shorter construct (using min/max instead of ifs) of what was already there, combined with the logarithmic sample basis. For very small tables, either the innodb_stats_sample_pages number if used or the actual number of pages, whichever is smaller – for bigger tables, the log2 of the #indexpages is used, multiplied by the dynamic system variable innodb_stats_sample_pages. So we can still scale and thus influence the system in case we want more samples. Simple, but it seems effective – and it any case we get decent results in many more cases than before, so it’s a worthwhile improvement. Obviously, since it’s a statistical sample, it could still be wrong for an individual case.

Jan reckons that just like MyISAM, InnoDB should do a table scan and work things out properly – I agree, this makes sense now that we have persistent stats. So the above is a good fix for 5.5 and 10.0, and the more significant change to comprehensive stats can be in a near future major release. So then we have done away with the sampling altogether, instead basing the info on the full dataset. Excellent.

Another issue that needed to be dealt with is when InnoDB recalculates the statistics. You don’t want to do it on every change, but regularly if there has been some change is good as it might affect which indexes should be chosen for optimal query execution. The hardcoded rule was 1/16th of the table or 2 billion rows, whichever comes first. Again that’s unfortunate, because for a bigger table 1/16th still amounts to a very significant number. I’d tend towards setting the upper bound to say 100,000. Jan put in a new dynamic system variable for this, stat_modified_counter. It essentially replaces the old static value of 2 billion, providing us with a configurable upper bound. That should do nicely!

Once again horay for open source, but in particular responsive and open development. If someone reports a bug and there is no interaction between developer and the outside world until something is released, the discussion and iterative improvement process to come to a good solution cannot occur. The original code definitely worked as designed, but it was no longer suitable in the context of today’s usage/needs. The user-dev interaction allowed for a much better conclusion.

Posted on

Optimising multi-threaded replication

Multi-threaded replication is a new feature introduced in MySQL 5.6 and MariaDB 10.0. In traditional single-threaded replication, the slaves have a disadvantage as they have to process in sequence what a master executed in parallel. This, plus the fact that slaves usually have a lot of read-only connections to deal with as well, can easily create performance problems. That is, a single-threaded slave needs to be set to allow fewer connections, otherwise there’s a higher risk of it not being able to keep up with the replication stream. There is no exact rule for this, as it relates to general I/O capacity and fsync latency, as well as general CPU and RAM considerations and query patterns.

Currently, it appears that the MariaDB implementation is a bit more mature in terms of design and effective implementation. For instance, MySQL 5.6 does not currently support retrying transactions while doing parallel replication. This can easily cause problems as commit conflicts are possible and obviously need to be handled. So for the purpose of this blog post, we’re going to focus on MariaDB 10.0, and it is what we currently use with some of our clients. MariaDB developer Kristian Nielsen has done awesome work and is very responsive to questions and bug reports. Rock on, Kristian!

The fundamental challenge for parallel replication is that some queries are safe to be executed in parallel, and some are not – and somehow, the server needs to know which is which. MariaDB employs two strategies to assist with this:

  1. Group commit. Since 5.5, transactions (remember, a standalone statement without START TRANSACTION/COMMIT is technically also a transaction) that happen around the same time in different connections are grouped in the binary log and effectively committed together. This is accomplished by the server trying to gather at least a certain number of transactions (binlog_commit_wait_count) and having individual connections wait just a fraction (binlog_commit_wait_usec) to increase the chances of gathering a nice number. This strategy reduces I/O and fsyncs, and thus helps quite a bit with write scaling. The miniscule delay that a transaction might incur because it has to wait is easily offset by the overall better performance. It’s good stuff. For the purpose of parallel replication, any transactions in the same group commit can in principle be executed in parallel on a slave – conflicts are possible, so deadlock handling and retries are essential.
  2. Global Transaction IDs (GTID) Domain IDs (gtid_domain_id) in MariaDB 10.0, which an application can set within a connection. Quite often, different applications and different components of applications use the same database server, but their actions are completely independent: no write operations will ever conflict between the different applications. GTID Domain IDs allows us to tell the server about this, allowing it to always run those transactions in parallel even if they weren’t part of the same group commit! Now that’s a real bonus!

Now, as a practicality, we’re not always able to modify applications to for instance set the GTID Domain ID. Plus, a magic (integer) number is required and so we need some planning/coordination between completely independent applications! Through database server consolidation, you may get applications on your server that were previously on a different one – strictly speaking having two applications use the same GTID Domain ID is harmless (after all, by default all transactions run in the same domain!) but obviously it doesn’t improve performance.

Open Query engineer Daniel Black and I came up with the following. It’s a combination of MySQL’s init_connect system variable (gets called when a user connects, except if they have SUPER privilege), a few stored procedures, and an event to keep the domain map reasonably up-to-date. The premise of this implementation is that each database username uniquely identifies an application, and that no two usernames refer to the same application. So, if you have for instance a general application user but also one for background scripts or one with special administrative privileges, then you need to modify the code in setdomain() a bit to take this into account. If you have transactions with a different GTID Domain ID execute in parallel on the same database, obviously this can cause conflicts. The MariaDB slave threads will retry, but in some cases conflicts cannot be resolved by retrying.

Obviously it’s not perfect, but it does resolve the issue for many situations. Feedback and improvements welcome!

# Automatic GTID Domain IDs for MariaDB 10.0
# Copyright (C) 2014 by Daniel Black & Arjen Lentz, Open Query Pty Ltd (
# Version 2014-11-18, initial publication via OQ blog (
# This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International

USE mysql

# modify this logic for your particular application/user naming convention
  SELECT domain INTO @l_gtid_domain_id
    FROM mysql.user_domain_map
   WHERE user=LEFT(cuser, LOCATE('@',cuser) -1 );

  SET SESSION gtid_domain_id=@l_gtid_domain_id;
END //

DROP PROCEDURE IF EXISTS create_user_domain_map //
  DECLARE userhostcur CURSOR FOR SELECT user,host FROM mysql.user;

  CREATE TABLE IF NOT EXISTS mysql.user_domain_map
    user CHAR(80) COLLATE utf8_bin NOT NULL UNIQUE

  INSERT IGNORE INTO mysql.user_domain_map(user)
         SELECT user FROM mysql.user;

  OPEN userhostcur;
  LOOP FETCH userhostcur INTO u,h;
    INSERT IGNORE INTO mysql.procs_priv(Host,Db,User, Routine_name, Routine_type, Grantor, Proc_priv)
           VALUES(h, 'mysql', u, 'setdomain', 'PROCEDURE', CURRENT_USER(), 'Execute');


# (re)create the user domain map
CALL create_user_domain_map(); 

# set up event schedule
CREATE EVENT update_user_domain_map ON SCHEDULE EVERY 1 DAY DO CALL create_user_domain_map(); 

# also set this in my.cnf so it's persistent
# init_connect='CALL mysql.setdomain(current_user());'
SET GLOBAL init_connect='CALL mysql.setdomain(current_user());';
Posted on

Database talks at OSDC 2014 Gold Coast

Open Query Engineer Daniel Black and Engineer/Trainer Peter Lock will be presenting sessions at the upcoming Open Source Developers’ Conference which is hosted at Griffith University Gold Coast Campus, 4-7 November 2014.

I also spotted

which should be very interesting as well. There many be more still, there are lots of sessions!

Full conference tickets cost less than $300 and include the lunches as well as the conference dinner, and all the tutorials/workshops in the main conference. Speaking from experience, OSDC is always great with good talks and excellent people to chat with.

With the conference over, the session videos are now online!

Posted on

GROUP BY fixed

Friend and former colleague Roland Bouwman has written an excellent update on the GROUP BY implementation in MySQL.

MySQL’s implementation of GROUP BY has historically been quirky. Sometimes that quirkiness has been useful, but often it causes grief as SQL authors can make mistakes that are executed but don’t produce the results they want (or expect).

Simple example:

SELECT cat, COUNT(val) as cnt, othercol FROM tbl GROUP BY cat

The ‘cat‘ column is in the GROUP BY clause, the COUNT(val) is an aggregate, but the ‘othercol‘ column is … well… neither. What used to effectively happen is that the server would pick one othercol value from within each group. As I noted before, sometimes useful but often a pest as the server wouldn’t know if you just made a mistake or whether you actually intended to use this ‘feature’.

The long existent sql_mode option ONLY_FULL_GROUP_BY mitigated some of that (but not fully in all cases – see Roland’s explanation).

With the sql_mode option enabled, the server chucks an error for faulty constructs, the most common one being:

Message: ‘col’ isn’t in GROUP BY

MySQL 5.7.5 and above have ONLY_FULL_GROUP_BY enabled by default, with an updated implementation, so GROUP BY will now generally behave properly.

We know that many existing applications often rely on the previously “standard” but technically quirky/faulty behaviour. So when upgrading to this version or beyond, it’s important to

  • use and test in development and staging environment first. This also means upgrading your dev environment first.
  • scan the code for GROUP BY use, and ensure that the queries are correct.
  • have proper error handling and reporting in your application.
  • in some cases disable the sql_mode ONLY_FULL_GROUP_BY option, but that’s really a last resort. It’s best to fix the application.


Posted on

Hard Drive Reliability

Posted on

Tracing down a problem, finding sloppy code

Daniel was tracking down what appeared to be a networking problem….

  • server reported 113 (No route to host)
  • However, an strace did not reveal the networking stack ever returning that.
  • On the other side, IP packets were actually received.
  • When confronted with mysteries like this, I get suspicious – mainly of (fellow) programmers.
  • I suggested a grep through the source code, which revealed  return -EHOSTUNREACH;
  • Mystery solved, which allowed us to find what was actually going on.


  1. Don’t just believe or presume the supposed origin of an error.
  2. Programmers often take shortcuts that cause grief later. I fully appreciate how the above code came about, but I still think it was wrong. Mapping a “similar” situation onto an existing error code is convenient. But when an error occurs, the most important thing is for people to be able to track down what the root cause is. Reporting this error outside of its original context (error code reported by network stack) is clearly unhelpful, it actually misdirects and requires people to essentially waste time to track it down (as above).
  3. Horay once again for Open Source, which makes it so much easier to figure these things out. While possibly briefly embarrassing for the programmer, more eyes allows code to improve better and faster – and, perhaps, also entices towards better coding practices from the outset (I can hope!).

What do you think?

Posted on

Munin graphing of MySQL

While there are many graphing tools out there and we’ve used Munin for a while now.

The MySQL plugin for Munin had fallen out of date and the show engine innodb status output changed in 5.5 making some bits of the plugin simply not work any more. Also the show global status has some extra variables so there was a need to create new graphs.

All of these are now in the 2.1.8+ development releases of Munin.

Here are samples of the new/updated graphs.


Munin table definations
Table Definitions


Innodb Buffer Pool Activity

Innodb Buffer Pool Internal Breakdown

Innodb Insert Buffer
Innodb Insert Buffer

Innodb Buffer Pool

Innodb Semaphores
Innodb Semaphores

Innodb Master Thread
Innodb Master Thread

Innodb Adaptive Hash Index

Innodb Queries and Transactions
Innodb Queries and Transactions

Innodb Read Views
Innodb Read Views

Innodb Descriptors
Innodb Descriptors

Performance Schema Losses
Performance Schema Losses

Query Cache
Query Cache

Maximum Memory of MySQL
Maximum Memory of MySQL


Handler Read
Handler Read

Handler Transaction
Handler Transaction

Handler Write
Handler Write

Handler Temporary Write/Updates
Handler Temporary Write/Updates

Execution (triggers and events)

Index Condition Pushdown

Multi Range Read Optimizations
Multi Range Read Optimizations

Some of these above graphs may miss a variable or two with MariaDB-10 because of variable name changes. These will be corrected when we get to those. In MariaDB-10 there is useful transition to information schema tables for status information which will make it significantly easier to parse.

Individual buffer pool information also has been parsed out however we haven’t worked out how to graphing this correctly. Also not yet merged is a bunch of Galera graphs which are currently waiting on some Galera provider changes.

We’ll continue to work with the Munin developers to keep this MySQL plugin up to date and useful.

There’s other graphs in the MySQL Munin plugins that we haven’t changed so aren’t included here.

Posted on

Password rules

The below comes from an Australian government site (formatting is mine, for readability):

“Your password must be a minimum length of nine characters, consisting of three of the following – lowercase (a-z) and uppercase (A-Z) alphabetic characters,
numeric characters (0-9) or
special characters (! $ # %).
It cannot contain any 2 consecutive characters that appear in your user ID, first name or last name.
It must not be one of your 8 previous passwords.”

That’s a serious looking ruleset. But does it actually make things safer?

I doubt it. What do you think?

Posted on

innodb_flush_logs_on_trx_commit and Galera Cluster

We deploy Galera Cluster (in MariaDB) for some clients, and innodb_flush_logs_on_trx_commit is one of the settings we’ve been playing with. The options according to the manual:

  • =0 don’t write or flush at commit, write and flush once per second
  • =1 write and flush at trx commit
  • =2 write log, but only flush once per second

The flush (fsync) refers to the mechanism the filesystem uses to try and guarantee that written data is actually on the physical medium/device and not just in a buffer (of course cached RAID controllers, SANs and other devices use some different logic there, but it’s definitely written beyond the OS space).

In a non-cluster setup, you’d always want it to be =1 in order to be ACID compliant and that’s also InnoDB’s default. So far so good. For cluster setups, you could be more lenient with this as you require ACID on the cluster as a whole, not each individual machine – after all, if one machine drops out at any point, you don’t lose any data.

Codership docu recommended =2, so that’s what Open Query engineer Peter Lock initially used for some tests that he was conducting. However, performance wasn’t particularly shiny – actually not much higher than =1. That in itself is interesting, because typically we regard the # of fsyncs/second a storage system can deal with as a key indicator of performance capacity. That is, as our HD Latency tool shows when you run it on a storage device (even your local laptop harddisk), the most prominent aspect of what limits the # of writes you can do per second appears to be the fsyncs.

I then happened to chat with Oli Sennhauser (former colleague from MySQL AB) who now runs the FromDual MySQL/MariaDB consulting firm in Switzerland, and he’s been working with Galera for quite a long time. He recognised the pattern and said that he too had that experience, and he thought =0 might be the better option.

I delved into the InnoDB source code to see what was actually happening, and the code indeed concurs with what’s described in the manual (that hasn’t always been the case ;-). I also verified this with Jeremy Cole whom we may happily regard as guru on “how InnoDB actually works”. The once-per-second flush (and optional preceding write) is performed by the InnoDB master thread. Take a peek in log/log0log.c and trx/trx0trx.c, specifically trx_commit_off_kernel() and srv_sync_log_buffer_in_background().

In conclusion:

  1. Even with =0, the log does get written and flushed once per second. This is done in the background so connection threads don’t have to wait for it.
  2. There is no setting where there is never a flush/fsync.
  3. With =2, the writing of the log takes place in the connection thread and this appears to incur a significant overhead, at least relative to =0. Aside from the writing of the log at transaction commit, there doesn’t appear to be a difference.
  4. Based on the preceding points, I would say that if you don’t want =1, you might as well set =0 in order to get the performance you’re after. There is of course a slight practical difference between =0 and =2. With =2 the log is immediately written. If the mysqld process were to crash within a second after that, the OS would close the file and have that log write stored. With =0 that log data wouldn’t have been written. If the OS or machine fails, that log write is lost either way.

In production environments, we tend to mainly want to mitigate trouble from system failures, so =0 appears to be a suitable/appropriate option – for a Galera cluster environment.

What remains is the question of why the log write operation appears to reduce transaction commit performance so much, in a way more so than the flush/fsync. Something to investigate further!
Your thoughts?