Posted on

MariaDB Galera cluster and GTID

In MariaDB 10.2.12, these two don’t yet work together. GTID = Global Transaction ID.  In the master-slave asynchronous replication realm, this means that you can reconnect a slave to another server (change its master) and it’ll happily continue replicating from the correct point.  No more fussing with filenames and offsets (which of course will both differ on different machines).

So in concept the GTIID is “globally” unique – that means it’s consistent across an entire infra: a binlogged write transaction will have the same GTID no matter on which machine you look at it.

  • OK: if you are transitioning from async replication to Galera cluster, and have a cluster as slave of the old infra, then GTID will work fine.
  • PROBLEM: if you want to run an async slave in a Galera cluster, GTID will currently not work. At least not reliably.

The overview issue is MDEV-10715, the specific problem is documented in MDEV-14153 with some comments from me from late last week. MDEV-14153 documents cases where the GTID is not in fact consistent – and the way in which it isn’t is most disturbing.

The issue appears as “drift”. A GTID is made up of R-S-# where R is replication domain (0 unless set by an app), S for server-id where the write was originally done, and # which is just a number. The only required rule for the # is that that each next event has to have a higher number than the previous.  In principle there could be #s missing, that’s ok.

In certain scenarios, the # part of the GTID falls behind on the “other nodes” in the Galera cluster. There was the node where the statement was first issued, and then there are the other nodes which pick up the change through the Galera (wsrep) cluster mechanism. Those other nodes.  So at that point, different nodes in the cluster have different GTIDs for the same query. Not so great.

To me, this looked like a giant red flag though: if a GTID is assigned on a commit, and then replicated through the cluster as part of that commit, it can’t change. Not drift, or any other change. So the only possible conclusion must be that it is in fact not passed through the cluster, but “reinvented” by a receiving cluster node, which simply assumes that the current event from a particular server-id is previous-event id + 1.  That assumption is false, because as I mentioned above it’s ok for gaps to exist.  As long as the number keeps going up, it’s fine.

Here is one of the simplest examples of breakage (extract from a binlog, with obfuscated table names):

# at 12533795
#180704 5:00:22 server id 1717 end_log_pos 12533837 CRC32 0x878fe96e GTID 0-1717-1672559880 ddl
/*!100001 SET @@session.gtid_seq_no=1672559880*//*!*/;
# at 12533837
#180704 5:00:22 server id 1717 end_log_pos 12534024 CRC32 0xc6f21314 Query thread_id=4468 exec_time=0 error_code=0
SET TIMESTAMP=1530644422/*!*/;
SET @@session.time_zone='SYSTEM'/*!*/;
DROP TEMPORARY TABLE IF EXISTS `qqq`.`tmp_foobar` /* generated by server */
/*!*/;

Fact: temporary tables are not replicated (imagine restarting a slave, it wouldn’t have whatever temporary tables were supposed to exist). So, while this event is stored in the binary log (which it is to ensure that if you replay the binlog on a machine, it correctly drops the temporary table after creating and using it), it won’t go through a cluster.  Remember that Galera cluster is essentially a ROW-based replication scheme; if there are changes in non-temporary tables, of course they get replicated just fine.  So if an app creates a temporary table, does some calculations, and then inserts the result of that into a regular table, the data of that last bit will get replicated. As it should. In a nutshell, as far as data consistency goes, we’re all fine.

But the fact that we have an event that doesn’t really get replicated creates the “fun” in the “let’s assume the next event is just the previous + 1” logic. This is where the drift comes in. Sigh.

In any case, this issue needs to be fixed by let’s say “being re-implemented”: the MariaDB GTID needs to be propagated through the Galera cluster, so it’s the same on every server, as it should be. Doing anything else is always going to go wrong somewhere, so trying to catch more cases like the above example is not really the correct way to go.

If you are affected by this or related problems, please do vote on the relevant MDEV issues. That is important!  If you need help tracking down problems, feel free to ask.  If you have more information on the matter, please comment too!  I’m sure this and related bugs will be fixed, there are very capable developers at MariaDB Corp and Codership Oy (the Galera company). But the more information we can provide, the better. It often helps with tracking down problems and creating reproducible test cases.

Posted on
Posted on

Optimising multi-threaded replication

Multi-threaded replication is a new feature introduced in MySQL 5.6 and MariaDB 10.0. In traditional single-threaded replication, the slaves have a disadvantage as they have to process in sequence what a master executed in parallel. This, plus the fact that slaves usually have a lot of read-only connections to deal with as well, can easily create performance problems. That is, a single-threaded slave needs to be set to allow fewer connections, otherwise there’s a higher risk of it not being able to keep up with the replication stream. There is no exact rule for this, as it relates to general I/O capacity and fsync latency, as well as general CPU and RAM considerations and query patterns.

Currently, it appears that the MariaDB implementation is a bit more mature in terms of design and effective implementation. For instance, MySQL 5.6 does not currently support retrying transactions while doing parallel replication. This can easily cause problems as commit conflicts are possible and obviously need to be handled. So for the purpose of this blog post, we’re going to focus on MariaDB 10.0, and it is what we currently use with some of our clients. MariaDB developer Kristian Nielsen has done awesome work and is very responsive to questions and bug reports. Rock on, Kristian!

The fundamental challenge for parallel replication is that some queries are safe to be executed in parallel, and some are not – and somehow, the server needs to know which is which. MariaDB employs two strategies to assist with this:

  1. Group commit. Since 5.5, transactions (remember, a standalone statement without START TRANSACTION/COMMIT is technically also a transaction) that happen around the same time in different connections are grouped in the binary log and effectively committed together. This is accomplished by the server trying to gather at least a certain number of transactions (binlog_commit_wait_count) and having individual connections wait just a fraction (binlog_commit_wait_usec) to increase the chances of gathering a nice number. This strategy reduces I/O and fsyncs, and thus helps quite a bit with write scaling. The miniscule delay that a transaction might incur because it has to wait is easily offset by the overall better performance. It’s good stuff. For the purpose of parallel replication, any transactions in the same group commit can in principle be executed in parallel on a slave – conflicts are possible, so deadlock handling and retries are essential.
  2. Global Transaction IDs (GTID) Domain IDs (gtid_domain_id) in MariaDB 10.0, which an application can set within a connection. Quite often, different applications and different components of applications use the same database server, but their actions are completely independent: no write operations will ever conflict between the different applications. GTID Domain IDs allows us to tell the server about this, allowing it to always run those transactions in parallel even if they weren’t part of the same group commit! Now that’s a real bonus!

Now, as a practicality, we’re not always able to modify applications to for instance set the GTID Domain ID. Plus, a magic (integer) number is required and so we need some planning/coordination between completely independent applications! Through database server consolidation, you may get applications on your server that were previously on a different one – strictly speaking having two applications use the same GTID Domain ID is harmless (after all, by default all transactions run in the same domain!) but obviously it doesn’t improve performance.

Open Query engineer Daniel Black and I came up with the following. It’s a combination of MySQL’s init_connect system variable (gets called when a user connects, except if they have SUPER privilege), a few stored procedures, and an event to keep the domain map reasonably up-to-date. The premise of this implementation is that each database username uniquely identifies an application, and that no two usernames refer to the same application. So, if you have for instance a general application user but also one for background scripts or one with special administrative privileges, then you need to modify the code in setdomain() a bit to take this into account. If you have transactions with a different GTID Domain ID execute in parallel on the same database, obviously this can cause conflicts. The MariaDB slave threads will retry, but in some cases conflicts cannot be resolved by retrying.

Obviously it’s not perfect, but it does resolve the issue for many situations. Feedback and improvements welcome!

# Automatic GTID Domain IDs for MariaDB 10.0
# Copyright (C) 2014 by Daniel Black & Arjen Lentz, Open Query Pty Ltd (http://openquery.com.au)
# Version 2014-11-18, initial publication via OQ blog (https://openquery.com.au/blog/)
#
# This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International
# http://creativecommons.org/licenses/by-sa/4.0/

USE mysql
DELIMITER //

DROP PROCEDURE IF EXISTS setdomain //
CREATE PROCEDURE setdomain(IN cuser varchar(140)) DETERMINISTIC READS SQL DATA SQL SECURITY DEFINER
BEGIN
  DECLARE EXIT HANDLER FOR NOT FOUND SET SESSION gtid_domain_id=10;
# modify this logic for your particular application/user naming convention
  SELECT domain INTO @l_gtid_domain_id
    FROM mysql.user_domain_map
   WHERE user=LEFT(cuser, LOCATE('@',cuser) -1 );

  SET SESSION gtid_domain_id=@l_gtid_domain_id;
END //

DROP PROCEDURE IF EXISTS create_user_domain_map //
CREATE PROCEDURE create_user_domain_map() MODIFIES SQL DATA
BEGIN
  DECLARE u CHAR(80);
  DECLARE h CHAR(60);
  DECLARE userhostcur CURSOR FOR SELECT user,host FROM mysql.user;
  DECLARE EXIT HANDLER FOR NOT FOUND FLUSH PRIVILEGES;

  CREATE TABLE IF NOT EXISTS mysql.user_domain_map
  (
    domain INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    user CHAR(80) COLLATE utf8_bin NOT NULL UNIQUE
  ) AUTO_INCREMENT=10, ENGINE=InnoDB;

  INSERT IGNORE INTO mysql.user_domain_map(user)
         SELECT user FROM mysql.user;

  OPEN userhostcur;
  LOOP FETCH userhostcur INTO u,h;
    INSERT IGNORE INTO mysql.procs_priv(Host,Db,User, Routine_name, Routine_type, Grantor, Proc_priv)
           VALUES(h, 'mysql', u, 'setdomain', 'PROCEDURE', CURRENT_USER(), 'Execute');
  END LOOP;
END;//

DELIMITER ; 

# (re)create the user domain map
CALL create_user_domain_map(); 

# set up event schedule
CREATE EVENT update_user_domain_map ON SCHEDULE EVERY 1 DAY DO CALL create_user_domain_map(); 

# also set this in my.cnf so it's persistent
# init_connect='CALL mysql.setdomain(current_user());'
SET GLOBAL init_connect='CALL mysql.setdomain(current_user());';
Posted on