mysql | Open Query

Posted on November 18, 2014November 18, 2014 by Arjen Lentz

Optimising multi-threaded replication

Multi-threaded replication is a new feature introduced in MySQL 5.6 and MariaDB 10.0. In traditional single-threaded replication, the slaves have a disadvantage as they have to process in sequence what a master executed in parallel. This, plus the fact that slaves usually have a lot of read-only connections to deal with as well, can easily create performance problems. That is, a single-threaded slave needs to be set to allow fewer connections, otherwise there’s a higher risk of it not being able to keep up with the replication stream. There is no exact rule for this, as it relates to general I/O capacity and fsync latency, as well as general CPU and RAM considerations and query patterns.

Currently, it appears that the MariaDB implementation is a bit more mature in terms of design and effective implementation. For instance, MySQL 5.6 does not currently support retrying transactions while doing parallel replication. This can easily cause problems as commit conflicts are possible and obviously need to be handled. So for the purpose of this blog post, we’re going to focus on MariaDB 10.0, and it is what we currently use with some of our clients. MariaDB developer Kristian Nielsen has done awesome work and is very responsive to questions and bug reports. Rock on, Kristian!

The fundamental challenge for parallel replication is that some queries are safe to be executed in parallel, and some are not – and somehow, the server needs to know which is which. MariaDB employs two strategies to assist with this:

Group commit. Since 5.5, transactions (remember, a standalone statement without START TRANSACTION/COMMIT is technically also a transaction) that happen around the same time in different connections are grouped in the binary log and effectively committed together. This is accomplished by the server trying to gather at least a certain number of transactions (binlog_commit_wait_count) and having individual connections wait just a fraction (binlog_commit_wait_usec) to increase the chances of gathering a nice number. This strategy reduces I/O and fsyncs, and thus helps quite a bit with write scaling. The miniscule delay that a transaction might incur because it has to wait is easily offset by the overall better performance. It’s good stuff. For the purpose of parallel replication, any transactions in the same group commit can in principle be executed in parallel on a slave – conflicts are possible, so deadlock handling and retries are essential.
Global Transaction IDs (GTID) Domain IDs (gtid_domain_id) in MariaDB 10.0, which an application can set within a connection. Quite often, different applications and different components of applications use the same database server, but their actions are completely independent: no write operations will ever conflict between the different applications. GTID Domain IDs allows us to tell the server about this, allowing it to always run those transactions in parallel even if they weren’t part of the same group commit! Now that’s a real bonus!

Now, as a practicality, we’re not always able to modify applications to for instance set the GTID Domain ID. Plus, a magic (integer) number is required and so we need some planning/coordination between completely independent applications! Through database server consolidation, you may get applications on your server that were previously on a different one – strictly speaking having two applications use the same GTID Domain ID is harmless (after all, by default all transactions run in the same domain!) but obviously it doesn’t improve performance.

Open Query engineer Daniel Black and I came up with the following. It’s a combination of MySQL’s init_connect system variable (gets called when a user connects, except if they have SUPER privilege), a few stored procedures, and an event to keep the domain map reasonably up-to-date. The premise of this implementation is that each database username uniquely identifies an application, and that no two usernames refer to the same application. So, if you have for instance a general application user but also one for background scripts or one with special administrative privileges, then you need to modify the code in setdomain() a bit to take this into account. If you have transactions with a different GTID Domain ID execute in parallel on the same database, obviously this can cause conflicts. The MariaDB slave threads will retry, but in some cases conflicts cannot be resolved by retrying.

Obviously it’s not perfect, but it does resolve the issue for many situations. Feedback and improvements welcome!

# Automatic GTID Domain IDs for MariaDB 10.0
# Copyright (C) 2014 by Daniel Black & Arjen Lentz, Open Query Pty Ltd (http://openquery.com.au)
# Version 2014-11-18, initial publication via OQ blog (https://openquery.com.au/blog/)
#
# This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International
# http://creativecommons.org/licenses/by-sa/4.0/

USE mysql
DELIMITER //

DROP PROCEDURE IF EXISTS setdomain //
CREATE PROCEDURE setdomain(IN cuser varchar(140)) DETERMINISTIC READS SQL DATA SQL SECURITY DEFINER
BEGIN
  DECLARE EXIT HANDLER FOR NOT FOUND SET SESSION gtid_domain_id=10;
# modify this logic for your particular application/user naming convention
  SELECT domain INTO @l_gtid_domain_id
    FROM mysql.user_domain_map
   WHERE user=LEFT(cuser, LOCATE('@',cuser) -1 );

  SET SESSION gtid_domain_id=@l_gtid_domain_id;
END //

DROP PROCEDURE IF EXISTS create_user_domain_map //
CREATE PROCEDURE create_user_domain_map() MODIFIES SQL DATA
BEGIN
  DECLARE u CHAR(80);
  DECLARE h CHAR(60);
  DECLARE userhostcur CURSOR FOR SELECT user,host FROM mysql.user;
  DECLARE EXIT HANDLER FOR NOT FOUND FLUSH PRIVILEGES;

  CREATE TABLE IF NOT EXISTS mysql.user_domain_map
  (
    domain INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    user CHAR(80) COLLATE utf8_bin NOT NULL UNIQUE
  ) AUTO_INCREMENT=10, ENGINE=InnoDB;

  INSERT IGNORE INTO mysql.user_domain_map(user)
         SELECT user FROM mysql.user;

  OPEN userhostcur;
  LOOP FETCH userhostcur INTO u,h;
    INSERT IGNORE INTO mysql.procs_priv(Host,Db,User, Routine_name, Routine_type, Grantor, Proc_priv)
           VALUES(h, 'mysql', u, 'setdomain', 'PROCEDURE', CURRENT_USER(), 'Execute');
  END LOOP;
END;//

DELIMITER ; 

# (re)create the user domain map
CALL create_user_domain_map(); 

# set up event schedule
CREATE EVENT update_user_domain_map ON SCHEDULE EVERY 1 DAY DO CALL create_user_domain_map(); 

# also set this in my.cnf so it's persistent
# init_connect='CALL mysql.setdomain(current_user());'
SET GLOBAL init_connect='CALL mysql.setdomain(current_user());';

Posted on November 18, 2014November 18, 2014 by Arjen Lentz

Posted on October 7, 2014November 18, 2014 by Arjen Lentz

Database talks at OSDC 2014 Gold Coast

Open Query Engineer Daniel Black and Engineer/Trainer Peter Lock will be presenting sessions at the upcoming Open Source Developers’ Conference which is hosted at Griffith University Gold Coast Campus, 4-7 November 2014.

I also spotted

Postgres & JSON: NoSQL in SQL? (Nick Moore)
Databases and Groovy (Paul King)

which should be very interesting as well. There many be more still, there are lots of sessions!

Full conference tickets cost less than $300 and include the lunches as well as the conference dinner, and all the tutorials/workshops in the main conference. Speaking from experience, OSDC is always great with good talks and excellent people to chat with.

With the conference over, the session videos are now online!

The general place to view is at Youtube, OSDC14 channel: https://www.youtube.com/user/osdc14
If you want to download a good quality video, grab the WEBM format files from the Linux Australia Mirror: http://mirror.linux.org.au/osdc/osdc2014/

Posted on October 7, 2014November 18, 2014 by Arjen Lentz

Posted on September 5, 2014 by Arjen Lentz — 1 Comment

Tracing down a problem, finding sloppy code

Daniel was tracking down what appeared to be a networking problem….

server reported 113 (No route to host)
However, an strace did not reveal the networking stack ever returning that.
On the other side, IP packets were actually received.
When confronted with mysteries like this, I get suspicious – mainly of (fellow) programmers.
I suggested a grep through the source code, which revealed return -EHOSTUNREACH;
Mystery solved, which allowed us to find what was actually going on.

Lessons:

Don’t just believe or presume the supposed origin of an error.
Programmers often take shortcuts that cause grief later. I fully appreciate how the above code came about, but I still think it was wrong. Mapping a “similar” situation onto an existing error code is convenient. But when an error occurs, the most important thing is for people to be able to track down what the root cause is. Reporting this error outside of its original context (error code reported by network stack) is clearly unhelpful, it actually misdirects and requires people to essentially waste time to track it down (as above).
Horay once again for Open Source, which makes it so much easier to figure these things out. While possibly briefly embarrassing for the programmer, more eyes allows code to improve better and faster – and, perhaps, also entices towards better coding practices from the outset (I can hope!).

What do you think?

Posted on September 5, 2014 by Arjen Lentz — 1 Comment

Posted on July 8, 2014 by danblack — 2 Comments

Munin graphing of MySQL

While there are many graphing tools out there and we’ve used Munin for a while now.

The MySQL plugin for Munin had fallen out of date and the show engine innodb status output changed in 5.5 making some bits of the plugin simply not work any more. Also the show global status has some extra variables so there was a need to create new graphs.

All of these are now in the 2.1.8+ development releases of Munin.

Here are samples of the new/updated graphs.

Munin table definations — Table Definitions

mysql2_innodb_bpool_act-day — Innodb Buffer Pool Activity

mysql_innodb_bpool_internal_breakdown-day — Innodb Buffer Pool Internal Breakdown

mysql_innodb_bpool-day — Innodb Buffer Pool

mysql_innodb_adaptive_hash-day — Innodb Adaptive Hash Index

mysql2_execution-day — Execution (triggers and events)

mysql_icp-day — Index Condition Pushdown

Some of these above graphs may miss a variable or two with MariaDB-10 because of variable name changes. These will be corrected when we get to those. In MariaDB-10 there is useful transition to information schema tables for status information which will make it significantly easier to parse.

Individual buffer pool information also has been parsed out however we haven’t worked out how to graphing this correctly. Also not yet merged is a bunch of Galera graphs which are currently waiting on some Galera provider changes.

We’ll continue to work with the Munin developers to keep this MySQL plugin up to date and useful.

There’s other graphs in the MySQL Munin plugins that we haven’t changed so aren’t included here.

Posted on July 8, 2014 by danblack — 2 Comments

Posted on March 25, 2014 by Arjen Lentz

innodb_flush_logs_on_trx_commit and Galera Cluster

We deploy Galera Cluster (in MariaDB) for some clients, and innodb_flush_logs_on_trx_commit is one of the settings we’ve been playing with. The options according to the manual:

=0 don’t write or flush at commit, write and flush once per second
=1 write and flush at trx commit
=2 write log, but only flush once per second

The flush (fsync) refers to the mechanism the filesystem uses to try and guarantee that written data is actually on the physical medium/device and not just in a buffer (of course cached RAID controllers, SANs and other devices use some different logic there, but it’s definitely written beyond the OS space).

In a non-cluster setup, you’d always want it to be =1 in order to be ACID compliant and that’s also InnoDB’s default. So far so good. For cluster setups, you could be more lenient with this as you require ACID on the cluster as a whole, not each individual machine – after all, if one machine drops out at any point, you don’t lose any data.

Codership docu recommended =2, so that’s what Open Query engineer Peter Lock initially used for some tests that he was conducting. However, performance wasn’t particularly shiny – actually not much higher than =1. That in itself is interesting, because typically we regard the # of fsyncs/second a storage system can deal with as a key indicator of performance capacity. That is, as our HD Latency tool shows when you run it on a storage device (even your local laptop harddisk), the most prominent aspect of what limits the # of writes you can do per second appears to be the fsyncs.

I then happened to chat with Oli Sennhauser (former colleague from MySQL AB) who now runs the FromDual MySQL/MariaDB consulting firm in Switzerland, and he’s been working with Galera for quite a long time. He recognised the pattern and said that he too had that experience, and he thought =0 might be the better option.

I delved into the InnoDB source code to see what was actually happening, and the code indeed concurs with what’s described in the manual (that hasn’t always been the case ;-). I also verified this with Jeremy Cole whom we may happily regard as guru on “how InnoDB actually works”. The once-per-second flush (and optional preceding write) is performed by the InnoDB master thread. Take a peek in log/log0log.c and trx/trx0trx.c, specifically trx_commit_off_kernel() and srv_sync_log_buffer_in_background().

In conclusion:

Even with =0, the log does get written and flushed once per second. This is done in the background so connection threads don’t have to wait for it.
There is no setting where there is never a flush/fsync.
With =2, the writing of the log takes place in the connection thread and this appears to incur a significant overhead, at least relative to =0. Aside from the writing of the log at transaction commit, there doesn’t appear to be a difference.
Based on the preceding points, I would say that if you don’t want =1, you might as well set =0 in order to get the performance you’re after. There is of course a slight practical difference between =0 and =2. With =2 the log is immediately written. If the mysqld process were to crash within a second after that, the OS would close the file and have that log write stored. With =0 that log data wouldn’t have been written. If the OS or machine fails, that log write is lost either way.

In production environments, we tend to mainly want to mitigate trouble from system failures, so =0 appears to be a suitable/appropriate option – for a Galera cluster environment.

What remains is the question of why the log write operation appears to reduce transaction commit performance so much, in a way more so than the flush/fsync. Something to investigate further!
Your thoughts?

Posted on March 25, 2014 by Arjen Lentz

Posted on November 12, 2013 by Arjen Lentz

Mixing databases usually not optimal

Dan McKinley (Etsy) wrote an [IMHO] insightful article Why MongoDB Never Worked at Etsy.

First off, it’s important to realise that it’s not a snipe at MongoDB – it’s a fine tool.

The lessons are related to mixing multiple databases in a deployment (administration and monitoring overhead) and the acknowledgement that issues of schema design, scalability and maintenance need attention regardless of which brand or technology you pick for your database. That comes back to the old insight that migrations are rarely worth it (regardless of what you migrate to what).

I think these are indeed important considerations as they have a major impact on the ongoing costs of your entire environment (production as well as development and testing) – these days we encounter the “we’re doing this part of our application using MongoDB” approach quite often, so it’s useful to read about and learn from other people’s experience.

With MongoDB there is a particular extra issue to consider, and Dan McKinley also mentions it in his post. NoSQL databases are often also schema-less. However, to keep your data manageable when it grows to significance, you do need to structure it somehow – that is, you need to make sure that (and I’ll just use generic terminology here) in a specific set of records each record contains the required fields. If you don’t, at some point things become unmanageable (or your data ends up as a pile of unusable bits).

Thus, you’re dealing with some form of schema, whether you call it that or not. And you might deal with it in application logic or through some toolkit, rather than in the database itself, but it can’t just be ignored or disregarded. And that’s critical, as often going to a schema-less database is presented as a “then you don’t need to worry about that” change. You do need to “worry” about it: you can pick where the most suitable place is for your needs. If you look at it in that way, you can make an appropriate choice for the particular application at hand.

Posted on November 12, 2013 by Arjen Lentz

Posted on November 6, 2013April 22, 2014 by Arjen Lentz

Luxbet, MariaDB and Melbourne Cup

Yesterday was Melbourne Cup day in Australia – the biggest annual horse race event in the country, and in the state of Victoria it’s even a public holiday.

Open Query does work for Luxbet (part of Tabcorp), and Melbourne Cup day is by far their biggest day of the year in terms of traffic. It’s not just a big spike, there’s orders of magnitude difference so you can really say that the rest of the year is downright quiet (in relative terms). So, a very interesting load pattern.

Since last year Luxbet has upgraded from stock MySQL to MariaDB, and with our input made some other infrastructure modifications including moving to a pure solid state storage (FusionIO) solution as a SAN just won’t deliver the resilience and performance required. This may seem odd, but remember that a) a SAN is also a single point of failure (so when the SAN fails, multiple db servers will be “out” – not desirable even though a failover to another datacenter is possible), and b) MariaDB/XtraDB (InnoDB) already have all recent data and indexes in RAM, so whatever I/O is required won’t benefit from a SAN cache. Thus, the SAN will have to actually do a physical disk seek and read to get what is needed, and we all know seeks are slow. A write or fsync also incurs some latency, regardless of the storage array speed.

So those are the reasons for the local storage solution. While there are aspects of RAID and other redundancy in that setup, the main resilience in the infrastructure comes from having more machines, rather than necessarily having more redundancy in each machine.

Grant is working on a more comprehensive version of this story.

Posted on November 6, 2013April 22, 2014 by Arjen Lentz

Posted on October 30, 2013 by Arjen Lentz — 2 Comments

MySQL Connector/Arduino

Chuck Bell, one of my former colleague from MySQL AB, has created a connector for Arduino to MySQL. So this allows Arduino code to be a direct client of a MySQL or MariaDB server, with Ethernet and WiFi shields supported.

With Arduino boards being used more and more, this can come in really handy – not only for retrieving (for instance) centralised configuration data, but also for logging. Useful stuff. Thanks Chuck!

Links

Introducing MySQL Connector/Arduino 1.0.0 beta (original article)
MySQL Connector/Arduino on Launchpad
Freetronics (Arduino gear)

Introducing MySQL Connector/Arduino 1.0.0 beta

Posted on October 30, 2013 by Arjen Lentz — 2 Comments

Posted on June 6, 2013 by Arjen Lentz — 11 Comments

Hint of the day: noatime and relatime in fstab

It’s been written about everywhere, but since we keep spotting installations in the wild where people don’t know about it, it probably deserves another mention.

By default, Linux uses the atime option on a disk mount, which means it writes a timestamp (e.g. a write to the drive) every time it reads anything. So in this case, reads cause writes – and also disk seeks, because a read from a file will then trigger having to write to the directory that contains the file. This even occurs if a file is read from the file system’s page cache (reading from the machine’s memory rather than the drive).

Unless you require an audit trail of users reading files, you generally you don’t want this. Thus, you want to add the noatime option to the disk mount in /etc/fstab. If you have just the defaults in there, you just make it defaults,noatime. It’ll doesn’t necesarily require a reboot as you can use umount/mount, but that gets tricky when dealing with the root filesystem so a reboot is generally easier. Setting these options is one of the first things we do when configuring a server.

Some user applications, such as Mutt (mail reader) do use the read access time. In that case, you can use the relatime option instead, which only writes a timestamp when a file or directory is written to. This is just for completeness of this story, as it’s still sub-optimal for a database server.

If you require read details for auditing (security) of the operating system, make sure all database-related files (database directories, InnoDB log files, binary logs, etc) are on a separate mount where you can use noatime.

Using noatime also makes a lot of sense on a web server, as it does a lot of reads. Remember, the fact that most files are in the filesystem cache doesn’t make a difference. As a general guide, it makes sense to set on most server installations. Quick win.

Posted on June 6, 2013 by Arjen Lentz — 11 Comments

Posted on May 13, 2013May 13, 2013 by Arjen Lentz — 5 Comments

LEVENSHTEIN MySQL stored function

At Open Query we steer clear of code development for clients. We sometimes advise on code, but as a company we don’t want to be in the programmer role. Naturally we do write scripts and other necessities to do our job.

Assisting with an Open Source project, I encountered three old UDFs. User Defined Functions are native functions that are compiled and then loaded by the server similar to a plugin. As with plugins, compiling can be a pest as it requires some of the server MySQL header files and matching build switches to the server it’s going to be loaded in. Consequentially, binaries cannot be considered safely portable and that means that you don’t really want to have a project rely on UDFs as it can hinder adoption quite severely.

Since MySQL 5.0 we can also use SQL stored functions and procedures. Slower, of course, but functional and portable. By the way, there’s one thing you can do with UDFs that you (at least currently) can’t do with stored functions, and that’s create a new aggregate function (like SUM or COUNT).

The other two functions were very specific to the app, but the one was a basic levenshtein implementation. A quick google showed that there were existing SQL and even MySQL stored function implementations, most derived from a single origin which was actually broken (and the link is now dead, as well). I grabbed one that appeared functional, and reformatted it for readability then cleaned it up a bit as it was doing some things in a convoluted way. Given that the stored function is going to be much slower than a native function anyway, doing things inefficiently inside loops can really hurt.

The result is below. Feel free to use, and if you spot a bug or can improve the code further, please let me know!
Given the speed issue, I’m actually thinking this should perhaps be added as a native function in MariaDB. What do you think?

-- core levenshtein function adapted from
-- function by Jason Rust (http://sushiduy.plesk3.freepgs.com/levenshtein.sql)
-- originally from http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/
-- rewritten by Arjen Lentz for utf8, code/logic cleanup and removing HEX()/UNHEX() in favour of ORD()/CHAR()
-- Levenshtein reference: http://en.wikipedia.org/wiki/Levenshtein_distance

-- Arjen note: because the levenshtein value is encoded in a byte array, distance cannot exceed 255;
-- thus the maximum string length this implementation can handle is also limited to 255 characters.

DELIMITER $$
DROP FUNCTION IF EXISTS LEVENSHTEIN $$
CREATE FUNCTION LEVENSHTEIN(s1 VARCHAR(255) CHARACTER SET utf8, s2 VARCHAR(255) CHARACTER SET utf8)
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR CHARACTER SET utf8;
    -- max strlen=255 for this function
    DECLARE cv0, cv1 VARBINARY(256);

    SET s1_len = CHAR_LENGTH(s1),
        s2_len = CHAR_LENGTH(s2),
        cv1 = 0x00,
        j = 1,
        i = 1,
        c = 0;

    IF (s1 = s2) THEN
      RETURN (0);
    ELSEIF (s1_len = 0) THEN
      RETURN (s2_len);
    ELSEIF (s2_len = 0) THEN
      RETURN (s1_len);
    END IF;

    WHILE (j <= s2_len) DO
      SET cv1 = CONCAT(cv1, CHAR(j)),
          j = j + 1;
    END WHILE;

    WHILE (i <= s1_len) DO
      SET s1_char = SUBSTRING(s1, i, 1),
          c = i,
          cv0 = CHAR(i),
          j = 1;

      WHILE (j <= s2_len) DO
        SET c = c + 1,
            cost = IF(s1_char = SUBSTRING(s2, j, 1), 0, 1);

        SET c_temp = ORD(SUBSTRING(cv1, j, 1)) + cost;
        IF (c > c_temp) THEN
          SET c = c_temp;
        END IF;

        SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
        IF (c > c_temp) THEN
          SET c = c_temp;
        END IF;

        SET cv0 = CONCAT(cv0, CHAR(c)),
            j = j + 1;
      END WHILE;

      SET cv1 = cv0,
          i = i + 1;
    END WHILE;

    RETURN (c);
  END $$

DELIMITER ;

Posted on May 13, 2013May 13, 2013 by Arjen Lentz — 5 Comments

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Introducing MySQL Connector/Arduino 1.0.0 beta

Share this:

Share this:

Share this: