replication | Open Query

Posted on July 10, 2018 by Arjen Lentz

MariaDB Galera cluster and GTID

In MariaDB 10.2.12, these two don’t yet work together. GTID = Global Transaction ID. In the master-slave asynchronous replication realm, this means that you can reconnect a slave to another server (change its master) and it’ll happily continue replicating from the correct point. No more fussing with filenames and offsets (which of course will both differ on different machines).

So in concept the GTIID is “globally” unique – that means it’s consistent across an entire infra: a binlogged write transaction will have the same GTID no matter on which machine you look at it.

OK: if you are transitioning from async replication to Galera cluster, and have a cluster as slave of the old infra, then GTID will work fine.
PROBLEM: if you want to run an async slave in a Galera cluster, GTID will currently not work. At least not reliably.

The overview issue is MDEV-10715, the specific problem is documented in MDEV-14153 with some comments from me from late last week. MDEV-14153 documents cases where the GTID is not in fact consistent – and the way in which it isn’t is most disturbing.

The issue appears as “drift”. A GTID is made up of R-S-# where R is replication domain (0 unless set by an app), S for server-id where the write was originally done, and # which is just a number. The only required rule for the # is that that each next event has to have a higher number than the previous. In principle there could be #s missing, that’s ok.

In certain scenarios, the # part of the GTID falls behind on the “other nodes” in the Galera cluster. There was the node where the statement was first issued, and then there are the other nodes which pick up the change through the Galera (wsrep) cluster mechanism. Those other nodes. So at that point, different nodes in the cluster have different GTIDs for the same query. Not so great.

To me, this looked like a giant red flag though: if a GTID is assigned on a commit, and then replicated through the cluster as part of that commit, it can’t change. Not drift, or any other change. So the only possible conclusion must be that it is in fact not passed through the cluster, but “reinvented” by a receiving cluster node, which simply assumes that the current event from a particular server-id is previous-event id + 1. That assumption is false, because as I mentioned above it’s ok for gaps to exist. As long as the number keeps going up, it’s fine.

Here is one of the simplest examples of breakage (extract from a binlog, with obfuscated table names):

# at 12533795
#180704 5:00:22 server id 1717 end_log_pos 12533837 CRC32 0x878fe96e GTID 0-1717-1672559880 ddl
/*!100001 SET @@session.gtid_seq_no=1672559880*//*!*/;
# at 12533837
#180704 5:00:22 server id 1717 end_log_pos 12534024 CRC32 0xc6f21314 Query thread_id=4468 exec_time=0 error_code=0
SET TIMESTAMP=1530644422/*!*/;
SET @@session.time_zone='SYSTEM'/*!*/;
DROP TEMPORARY TABLE IF EXISTS `qqq`.`tmp_foobar` /* generated by server */
/*!*/;

Fact: temporary tables are not replicated (imagine restarting a slave, it wouldn’t have whatever temporary tables were supposed to exist). So, while this event is stored in the binary log (which it is to ensure that if you replay the binlog on a machine, it correctly drops the temporary table after creating and using it), it won’t go through a cluster. Remember that Galera cluster is essentially a ROW-based replication scheme; if there are changes in non-temporary tables, of course they get replicated just fine. So if an app creates a temporary table, does some calculations, and then inserts the result of that into a regular table, the data of that last bit will get replicated. As it should. In a nutshell, as far as data consistency goes, we’re all fine.

But the fact that we have an event that doesn’t really get replicated creates the “fun” in the “let’s assume the next event is just the previous + 1” logic. This is where the drift comes in. Sigh.

In any case, this issue needs to be fixed by let’s say “being re-implemented”: the MariaDB GTID needs to be propagated through the Galera cluster, so it’s the same on every server, as it should be. Doing anything else is always going to go wrong somewhere, so trying to catch more cases like the above example is not really the correct way to go.

If you are affected by this or related problems, please do vote on the relevant MDEV issues. That is important! If you need help tracking down problems, feel free to ask. If you have more information on the matter, please comment too! I’m sure this and related bugs will be fixed, there are very capable developers at MariaDB Corp and Codership Oy (the Galera company). But the more information we can provide, the better. It often helps with tracking down problems and creating reproducible test cases.

Posted on July 10, 2018 by Arjen Lentz

Posted on February 22, 2018February 22, 2018 by Arjen Lentz

RDS Aurora MySQL and Service Interruptions

In Amazon space, any EC2 or Service instance can “disappear” at any time. Depending on which service is affected, the service will be automatically restarted. In EC2 you can choose whether an interrupted instance will be restarted, or left shutdown.

For an Aurora instance, an interrupted instance is always restarted. Makes sense.

The restart timing, and other consequences during the process, are noted in our post on Aurora Failovers.

Aurora Testing Limitations

As mentioned earlier, we love testing “uncontrolled” failovers. That is, we want to be able to pull any plug on any service, and see that the environment as a whole continues to do its job. We can’t do that with Aurora, because we can’t control the essentials:

power button;
reset switch;
ability to kill processes on a server;
and the ability to change firewall settings.

In Aurora, an instance is either running, or will (again) be running shortly. So that we know. Aurora MySQL also offers some commands that simulate various failure scenarios, but since they are built-in we can presume that those scenarios are both very well tested, as well as covered by the automation around the environment. Those clearly defined cases are exactly the situations we’re not interested in.

What if, for instance, a server accepts new connections but is otherwise unresponsive? We’ve seen MySQL do this on occasion. Does Aurora catch this? We don’t know and we have no way of testing that, or many other possible problem scenarios. That irks.

The Need to Know

If an automated system is able to catch a situation, that’s great. But if your environment can end up in a state such as described above and the automated systems don’t catch and handle it, you could be dead in the water for an undefined amount of time. If you have scripts to catch cases such as these, but the automated systems catch them as well, you want to be sure that you don’t trigger “double failovers” or otherwise interfere with a failover-in-progress. So either way, you need to know and and be aware whether a situation is caught and handled, and be able to test specific scenarios.

In summary: when you know the facts, then you can assess the risk in relation to your particular needs, and mitigate where and as desired.

A corporate guarantee of “everything is handled and it’ll be fine” (or as we say in Australia “She’ll be right, mate!“) is wholly unsatisfactory for this type of risk analysis and mitigation exercise. Guarantees and promises, and even legal documents, don’t keep environments online. Consequently, promises and legalities don’t keep a company alive.

So what does? In this case, engineers. But to be able to do their job, engineers need to know what parameters they’re working with, and have the ability to test any unknowns. Unfortunately Aurora is, also in this respect, a black box. You have to trust, and can’t comprehensively verify. Sigh.

Posted on February 22, 2018February 22, 2018 by Arjen Lentz

Posted on February 21, 2018February 21, 2018 by Arjen Lentz

RDS Aurora MySQL Failover

Right now Aurora only allows a single master, with up to 15 read-only replicas.

Master/Replica Failover

We love testing failure scenarios, however our options for such tests with Aurora are limited (we might get back to that later). Anyhow, we told the system, through the RDS Aurora dashboard, to do a failover. These were our observations:

Role Change Method

Both master and replica instances are actually restarted (the MySQL uptime resets to 0).

This is quite unusual these days, we can do a fully controlled role change in classic asynchronous replication without a restart (CHANGE MASTER TO …), and Galera doesn’t have read/write roles as such (all instances are technically writers) so it doesn’t need role changes at all.

Failover Timing

Failover between running instances takes about 30 seconds. This is in line with information provided in the Aurora FAQ.

Failover where a new instance needs to be spun up takes 15 minutes according to the FAQ (similar to creating a new instance from the dash).

Instance Availability

During a failover operation, we observed that all connections to the (old) master, and the replica that is going to be promoted, are first dropped, then refused (the connection refusals will be during the period that the mysqld process is restarting).

According to the FAQ, reads to all replicas are interrupted during failover. Don’t know why.

Aurora can deliver a DNS CNAME for your writer instance. In a controlled environment like Amazon, with guaranteed short TTL, this should work ok and be updated within the 30 seconds that the shortest possible failover scenario takes. We didn’t test with the CNAME directly as we explicitly wanted to observe the “raw” failover time of the instances themselves, and the behaviour surrounding that process.

Caching State

On the promoted replica, the buffer pool is saved and loaded (warmed up) on the restart; good! Note that this is not special, it’s desired and expected to happen: MySQL and MariaDB have had InnoDB buffer pool save/restore for years. Credit: Jeremy Cole initially came up with the buffer pool save/restore idea.

On the old master (new replica/slave), the buffer pool is left cold (empty). Don’t know why. This was a controlled failover from a functional master.

Because of the server restart, other caches are of course cleared also. I’m not too fussed about the query cache (although, deprecated as it is, it’s currently still commonly used), but losing connections is a nuisance. More detail on that later in this article.

Statistics

Because of the instance restarts, the running statistics (SHOW GLOBAL STATUS) are all reset to 0. This is annoying, but should not affect proper external stats gathering, other than for uptime.

On any replica, SHOW ENGINE INNODB STATUS comes up empty. Always. This seems like obscurity to me, I don’t see a technical reason to not show it. I suppose that with a replica being purely read-only, most running info is already available through SHOW GLOBAL STATUS LIKE ‘innodb%’, and you won’t get deadlocks on a read-only slave.

Multi-Master

Aurora MySQL multi-master was announced at Amazon re:Invent 2017, and appears to currently be in restricted beta test. No date has been announced for general availability.

We’ll have to review it when it’s available, and see how it works in practice.

Conclusion

Requiring 30 seconds or more for a failover is unfortunate, this is much slower than other MySQL replication (writes can failover within a few seconds, and reads are not interrupted) and Galera cluster environments (which essentially delivers continuity across instance failures – clients talking to the failed instance will need to reconnect to the loadbalancer/cluster to continue).

I don’t understand why the old master gets a cold InnoDB buffer pool.

I wouldn’t think a complete server restart should be necessary, but since we don’t have insight in the internals, who knows.

On Killing Connections (through the restart)

Losing connections across an Aurora cluster is a real nuisance that really impacts applications. Here’s why:

When MySQL C client library (which most MySQL APIs either use or are modelled on) is disconnected, it passes back a specific error to the application. When the application makes its next query call, the C client will automatically reconnect first (so the client does not have to explicitly reconnect). So a client only needs to catch the error and re-issue its last command, and all will generally be fine. Of course, if it relies on different SESSION settings, or was in the middle of a multi-statement transaction, it will need to do a bit more.

So, this means that the application has to handle disconnects gracefully without chucking hissy-fits at users, and I know for a fact that that’s not how many (most?) applications are written. Consequently, an Aurora failover will make the frontend of most applications look like a disaster zone for about 30 seconds (provided functional instances are available for the failover, which is the preferred and best case scenario).

I appreciate that this is not directly Aurora’s fault, it’s sloppy application development that causes this, but it’s a real-world fact we have to deal with. And, perhaps importantly: other cluster and replication options do not trigger this scenario.

Posted on February 21, 2018February 21, 2018 by Arjen Lentz

Posted on February 20, 2018February 20, 2018 by Arjen Lentz

Exploring Amazon RDS Aurora: replica writes and cache chilling

Our clients operate on a variety of platforms, and RDS (Amazon Relational Database Service) Aurora has received quite a bit of attention in recent times. On behalf of our clients, we look beyond the marketing, and see what the technical architecture actually delivers. We will address specific topics in individual posts, this time checking out what the Aurora architecture means for write and caching behaviour (and thus performance).

What is RDS Aurora?

First of all, let’s declare the baseline. MySQL Aurora is not a completely new RDBMS. It comprises a set of Amazon modifications on top of stock Oracle MySQL 5.6 and 5.7, implementing a different replication mechanism and some other changes/additions. While we have some information (for instance from the “deep dive” by AWS VP Anurag Gupta), the source code of the Aurora modifications are not published, so unfortunately it is not immediately clear how things are implemented. Any architecture requires choices to be made, trade-offs, and naturally these have consequences. Because we don’t get to look inside the “black box” directly, we need to explore indirectly. We know how stock MySQL is architected, so by observing Aurora’s behaviour we can try to derive how it is different and what it might be doing. Mind that this is equivalent to looking at a distant star, seeing a wobble, and deducing from the pattern that there must be one or more planets orbiting. It’s an educated guess.

For the sake of brevity, I have to skip past some aspects that can be regarded as “obvious” to someone with insight into MySQL’s architecture. I might also defer explaining a particular issue in depth to a dedicated post on that topic. Nevertheless, please do feel free to ask “so why does this work in this way”, or other similar questions – that’ll help me check my logic trail and tune to the reader audience, as well as help create a clearer picture of the Aurora architecture.

Instead of using the binary log, Aurora replication ties into the storage layer. It only supports InnoDB, and instead of doing disk reads/writes, the InnoDB I/O system talks to an Amazon storage API which delivers a shared/distributed storage, which can work across multiple availability zones (AZs). Thus, a write on the master will appear on the storage system (which may or may not really be a filesystem). Communication between AZs is fairly fast (only 2-3 ms extra overhead, relative to another server in the same AZ) so clustering databases or filesystems across AZs is entirely feasible, depending on the commit mechanism (a two-phase commit architecture would still be relatively slow). We do multi-AZ clustering with Galera Cluster (Percona XtraDB Cluster or MariaDB Galera Cluster). Going multi-AZ is a good idea that provides resilience beyond a single data centre.

So, imagine an individual instance in an Aurora setup as an EC2 (Amazon Elastic Computing) instance with MySQL using an SSD EBS (Amazon Elastic Block Storage) volume, where the InnoDB I/O threads interface more directly the the EBS API. The actual architecture might be slightly different still (more on that in a later post), but this rough description helps set up a basic idea of what a node might look like.

Writes in MySQL

In a regular MySQL, on commit a few things happen:

the InnoDB log is written to and flushed,
the binary log is written to (and possibly flushed), and
the changed pages (data and indexes) in the InnoDB buffer pool are marked dirty, so a background thread knows they need to be written back to disk (this does not need to happen immediately). When a page is written to disk, normally it uses a “double-write” mechanism where first the original page is read and written to a scratch space, and then the new page is put in the original position. Depending on the filesystem and underlying storage (spinning disk, or other storage with different block size from InnoDB page size) this may be required to be able to recover from write fails.

This does not translate in to as many IOPS because in practice, transaction commits are put together (for instance with MariaDB’s group commit) and thus many commits that happen in a short space effectively only use a few IOs for their log writes. With Galera cluster, the local logs are written but not flushed, because the guaranteed durability is provided with other nodes in the cluster rather than local persistence of the logfile.

In Aurora, a commit has to send either the InnoDB log entries or the changed data pages to the storage layer; which one it is doesn’t particularly matter. The storage layer has a “quorum set” mechanism to ensure that multiple nodes accept the new data. This is similar to Galera’s “certification” mechanism that provides the “virtual synchrony”. The Aurora “deep dive” talk claims that it requires many fewer IOPS for a commit; however, it appears they are comparing a worst-case plain MySQL scenario with an optimal Aurora environment. Very marketing.

Aurora does not use the binary log, which does make one wonder about point-in-time recovery options. Of course, it is possible to recover to any point-in-time from an InnoDB snapshot + InnoDB transaction logs – this would require adding timestamps to the InnoDB transaction log format.

While it is noted that the InnoDB transaction log is also backed up to S3, it doesn’t appear to be used directly (so, only for recovery purposes then). After all, any changed page needs to be communicated to the other instances, so essentially all pages are always flushed (no dirty pages). When we look at the InnoDB stats GLOBAL STATUS, we sometimes do see up to a couple of dozen dirty pages with Aurora, but their existence or non-existence doesn’t appear to have any correlation with user-created tables and data.

Where InnoDB gets its Speed

We all know that disk-access is slow. In order for InnoDB to be fast, it is dependent on most active data being in the buffer pool. InnoDB does not care for local filesystem buffers – something is either in persistent storage, or in the buffer pool. In configurations, we prefer direct I/O so the system calls that do the filesystem I/O bypass the filesystem buffers and any related overhead. When a query is executed, any required page that’s not yet in the buffer pool is requested to be loaded in the background. Naturally, this does slow down queries, which is why we preferably want all necessary pages to already be in memory. This applies for any type of query. In InnoDB, all data/indexes are structured in B+trees, so an INSERT has to be merged into a page and possibly causes pages to be split and other items shuffled so as to “re-balance” the tree. Similarly, a delete may cause page merges and a re-balancing operation. This way the depth of the tree is controlled, so that even for a billion rows you would generally see a depth of no more than 6-8 pages. That is, retrieving any row would only require a maximum of 6-8 page reads (potentially from disk).

I’m telling you all this, because while most replication and clustering mechanisms essentially work with the buffer pool, Aurora replication appears to works against it. As I mentioned: choices have consequences (trade-offs). So, what happens?

Aurora Replication

When you do a write in MySQL which gets replicated through classic asynchronous replication, the slaves or replica nodes affect the row changes in memory. This means that all the data (which is stored with the PRIMARY KEY, in InnoDB) as well as any other indexes are updated, the InnoDB log is written, and the pages marked as dirty. It’s very similar to what happens on the writer/master system, and thus the end result in memory is virtually identical. While Galera’s cluster replication operates differently from the asynchronous mechanism shown in the diagram, the resulting caching (which pages are in memory) ends up similar.

Not so with Aurora. Aurora replicates in the storage layer, so all pages are updated in the storage system but not in the in-memory InnoDB buffer pool. A secondary notification system between the instances ensures that cached InnoDB pages are invalidated. When you next do a query that needs any of those no-longer-valid cached pages, they will have to be be re-read from the storage system. You can see a representation of this in the diagram below, indicating invalidated cache pages in different indexes; as shown, for INSERT operations, you’re likely to have pages higher up in the tree and one sideways page change as well because of the B+tree-rebalancing.

The Chilling Effect

We can tell the replica is reading from storage, because the same query is much slower than before we did the insert from the master instance. Note: this wasn’t a matter of timing. Even if we waited slightly longer (to enable a possible background thread to refresh the pages) the post-insert query was just as slow.

Interestingly, the invalidation process does not actually remove them from the buffer pool (that is, the # of pages in the buffer pool does not go down); however, the # of page reads does not go up either when the page is clearly re-read. Remember though that a status variable is just that, it has to be updated to be visible and it simply means that the new functions Amazon implemented don’t bother updating these status variables. Accidental omission or purposeful obscurity? Can’t say. I will say that it’s very annoying when server statistics don’t reflect what’s actually going on, as it makes the stats (and their analysis) meaningless. In this case, the picture looks better than it is.

With each Aurora write (insert/update/delete), the in-memory buffer pool on replicas is “chilled”.

Unfortunately, it’s not even just the one query on the replica that gets affected after a write. The primary key as well as the secondary indexes get chilled. If the initial query uses one particular secondary index, that index and the primary key will get warmed up again (at the cost of multiple storage system read operations), however the other secondary indexes are still chattering their teeth.

Being Fast on the Web

In web applications (whether websites or web-services for mobile apps), typically the most recently added data is the most likely to be read again soon. This is why InnoDB’s buffer pool is normally very effective: frequently accessed pages remain in memory, while lesser used ones “age” and eventually get tossed out to make way for new pages.

Having caches clear due to a write, slows things down. In the MySQL space, the fairly simply query cache is a good example. Whenever you write to table A, any cached SELECTs that accesses table A are cleared out of the cache. Regardless of whether the application is read-intensive, having regular writes makes the query cache useless and we turn it off in those cases. Oracle has already deprecated the “good old” query cache (which was introduced in MySQL 4.0 in the early 2000s) and soon its code will be completely removed.

Conclusion

With InnoDB, you’d generally have an AUTO_INCREMENT PRIMARY KEY, and thus newly inserted rows are sequenced to that outer end of the B+Tree. This also means that the next inserted row often ends up in the same page, again invalidating that recently written page on the replicas and slowing down reads of any of the rows it contained.

For secondary indexes, the effect is obviously scattered although if the indexed column is temporal (time-based), it will be similarly affected to the PRIMARY KEY.

How much all of this slows things down will very much depend on your application DB access profile. The read/write ratio will matter little, but rather whether individual tables are written to fairly frequently. If they do, SELECT queries on those tables made on replicas will suffer from the chill.

Aurora uses SSD EBS so of course the storage access is pretty fast. However, memory is always faster, and we know that that’s important for web application performance. And we can use similarly fast SSD storage on EC2 or another hosting provider, with mature scaling technologies such as Galera (or even regular asynchronous multi-threaded replication) that don’t give your caches the chills.

Posted on February 20, 2018February 20, 2018 by Arjen Lentz

Posted on July 27, 2016 by Arjen Lentz

Choosing Whether to Migrate to Another Database: Uber

Uber Engineering explains the technical reasoning behind its switch in database technologies, from Postgres to MySQL: https://eng.uber.com/mysql-migration/

These things are always an interesting read, because it looks at one company’s decision making process and operational steps.

At Open Query we’re not a fan of migrations – it doesn’t matter from what brand to what brand, migrations tend to be expensive and painful. Any application tends to be more suited to a particular brand – because of design and implementation choices. This is neither good nor bad, it’s just a fact that has to be acknowledged when considering these things.

Similarly, infrastructure (what hardware there is and how it’s set up and connected) tends to be dependent on the brand choice, as different databases have different needs in that space, particularly when looking at replication and clustering as well as with larger data stores.

Consequently, moving to another brand is (most likely) going to be less optimal – unless the original was really not optimal (for the original brand it ran on). Commonly happens when either an attempt was made to keep the application brand agnostic (this tends to work out very badly for performance and scalability), or if the schema design, infrastructure or application was developed by people with experience in a different brand – then some of their design decisions might have been suited for that brand they’re familiar with, but not the brand they were actually deploying on in this particular instance.

So in our opinion, there’s no such thing as a “better database”. What works best in a particular case depends on many things, including the practical needs, but also any pre-existing hardware and codebase as well as experience/expertise of developers and other people. For instance, if a desired brand cannot be efficiently deployed on the existing hardware but there is no budget to replace that infrastructure, it would be folly to move to that brand as it just wouldn’t perform effectively – and quite likely cause significant troubles (beyond the migration process itself).

The fact that Uber moved from one brand to another doesn’t mean that that’s the best choice for anybody else. Every situation is different. There are valid reasons for choosing to go the difficult path of database migration, but it needs to be very carefully considered. In the case of Uber, they choose to use a special “schemaless” layer on top of MySQL, which would completely change the way things work. Certain schemas and query constructs are more optimal in one brand than another, but when you add such extensive abstraction layers on top, it gets even more complex.

Posted on July 27, 2016 by Arjen Lentz

Posted on November 18, 2014November 18, 2014 by Arjen Lentz

Optimising multi-threaded replication

Multi-threaded replication is a new feature introduced in MySQL 5.6 and MariaDB 10.0. In traditional single-threaded replication, the slaves have a disadvantage as they have to process in sequence what a master executed in parallel. This, plus the fact that slaves usually have a lot of read-only connections to deal with as well, can easily create performance problems. That is, a single-threaded slave needs to be set to allow fewer connections, otherwise there’s a higher risk of it not being able to keep up with the replication stream. There is no exact rule for this, as it relates to general I/O capacity and fsync latency, as well as general CPU and RAM considerations and query patterns.

Currently, it appears that the MariaDB implementation is a bit more mature in terms of design and effective implementation. For instance, MySQL 5.6 does not currently support retrying transactions while doing parallel replication. This can easily cause problems as commit conflicts are possible and obviously need to be handled. So for the purpose of this blog post, we’re going to focus on MariaDB 10.0, and it is what we currently use with some of our clients. MariaDB developer Kristian Nielsen has done awesome work and is very responsive to questions and bug reports. Rock on, Kristian!

The fundamental challenge for parallel replication is that some queries are safe to be executed in parallel, and some are not – and somehow, the server needs to know which is which. MariaDB employs two strategies to assist with this:

Group commit. Since 5.5, transactions (remember, a standalone statement without START TRANSACTION/COMMIT is technically also a transaction) that happen around the same time in different connections are grouped in the binary log and effectively committed together. This is accomplished by the server trying to gather at least a certain number of transactions (binlog_commit_wait_count) and having individual connections wait just a fraction (binlog_commit_wait_usec) to increase the chances of gathering a nice number. This strategy reduces I/O and fsyncs, and thus helps quite a bit with write scaling. The miniscule delay that a transaction might incur because it has to wait is easily offset by the overall better performance. It’s good stuff. For the purpose of parallel replication, any transactions in the same group commit can in principle be executed in parallel on a slave – conflicts are possible, so deadlock handling and retries are essential.
Global Transaction IDs (GTID) Domain IDs (gtid_domain_id) in MariaDB 10.0, which an application can set within a connection. Quite often, different applications and different components of applications use the same database server, but their actions are completely independent: no write operations will ever conflict between the different applications. GTID Domain IDs allows us to tell the server about this, allowing it to always run those transactions in parallel even if they weren’t part of the same group commit! Now that’s a real bonus!

Now, as a practicality, we’re not always able to modify applications to for instance set the GTID Domain ID. Plus, a magic (integer) number is required and so we need some planning/coordination between completely independent applications! Through database server consolidation, you may get applications on your server that were previously on a different one – strictly speaking having two applications use the same GTID Domain ID is harmless (after all, by default all transactions run in the same domain!) but obviously it doesn’t improve performance.

Open Query engineer Daniel Black and I came up with the following. It’s a combination of MySQL’s init_connect system variable (gets called when a user connects, except if they have SUPER privilege), a few stored procedures, and an event to keep the domain map reasonably up-to-date. The premise of this implementation is that each database username uniquely identifies an application, and that no two usernames refer to the same application. So, if you have for instance a general application user but also one for background scripts or one with special administrative privileges, then you need to modify the code in setdomain() a bit to take this into account. If you have transactions with a different GTID Domain ID execute in parallel on the same database, obviously this can cause conflicts. The MariaDB slave threads will retry, but in some cases conflicts cannot be resolved by retrying.

Obviously it’s not perfect, but it does resolve the issue for many situations. Feedback and improvements welcome!

# Automatic GTID Domain IDs for MariaDB 10.0
# Copyright (C) 2014 by Daniel Black & Arjen Lentz, Open Query Pty Ltd (http://openquery.com.au)
# Version 2014-11-18, initial publication via OQ blog (https://openquery.com.au/blog/)
#
# This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International
# http://creativecommons.org/licenses/by-sa/4.0/

USE mysql
DELIMITER //

DROP PROCEDURE IF EXISTS setdomain //
CREATE PROCEDURE setdomain(IN cuser varchar(140)) DETERMINISTIC READS SQL DATA SQL SECURITY DEFINER
BEGIN
  DECLARE EXIT HANDLER FOR NOT FOUND SET SESSION gtid_domain_id=10;
# modify this logic for your particular application/user naming convention
  SELECT domain INTO @l_gtid_domain_id
    FROM mysql.user_domain_map
   WHERE user=LEFT(cuser, LOCATE('@',cuser) -1 );

  SET SESSION gtid_domain_id=@l_gtid_domain_id;
END //

DROP PROCEDURE IF EXISTS create_user_domain_map //
CREATE PROCEDURE create_user_domain_map() MODIFIES SQL DATA
BEGIN
  DECLARE u CHAR(80);
  DECLARE h CHAR(60);
  DECLARE userhostcur CURSOR FOR SELECT user,host FROM mysql.user;
  DECLARE EXIT HANDLER FOR NOT FOUND FLUSH PRIVILEGES;

  CREATE TABLE IF NOT EXISTS mysql.user_domain_map
  (
    domain INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    user CHAR(80) COLLATE utf8_bin NOT NULL UNIQUE
  ) AUTO_INCREMENT=10, ENGINE=InnoDB;

  INSERT IGNORE INTO mysql.user_domain_map(user)
         SELECT user FROM mysql.user;

  OPEN userhostcur;
  LOOP FETCH userhostcur INTO u,h;
    INSERT IGNORE INTO mysql.procs_priv(Host,Db,User, Routine_name, Routine_type, Grantor, Proc_priv)
           VALUES(h, 'mysql', u, 'setdomain', 'PROCEDURE', CURRENT_USER(), 'Execute');
  END LOOP;
END;//

DELIMITER ; 

# (re)create the user domain map
CALL create_user_domain_map(); 

# set up event schedule
CREATE EVENT update_user_domain_map ON SCHEDULE EVERY 1 DAY DO CALL create_user_domain_map(); 

# also set this in my.cnf so it's persistent
# init_connect='CALL mysql.setdomain(current_user());'
SET GLOBAL init_connect='CALL mysql.setdomain(current_user());';

Posted on November 18, 2014November 18, 2014 by Arjen Lentz

Posted on May 27, 2013 by Arjen Lentz — 1 Comment

Temporary Tables and Replication

I recently wrote about non-deterministic queries in the replication stream. That’s resolved by using either MIXED or ROW based replication rather than STATEMENT based.

Another thing that’s not fully handled by STATEMENT based replication is temporary tables. Imagine the following:

Master: CREATE TEMPORARY TABLE rpltmpbreak (i INT);
Wait for slave to replicate this statement, then stop and start mysqld (not just STOP/START SLAVE)
Master: INSERT INTO rpltmpbreak VALUES (1);
Slave: SHOW SLAVE STATUS \G

If for any reason a slave server shuts down and restarts after the temp table creation, replication will break because the temporary table will no longer exist on the restarted slave server. It’s obvious when you think about it, but nevertheless it’s quite annoying.

A long time ago (early 2007, when I was still working at MySQL AB) I filed a bug report on this. It’s important to realise that back then, row based replication did exist but was so buggy that you wouldn’t recommend it, so the topic was quite relevant. For some reason the bug has remained open for over 6 years until some recent activity.

It is not an issue with determinism and most temporary table constructs are technically regarded as “safe” to replicate via statement based replication, so if you use MIXED you will still find replication broken with the above scenario. Important to realise!

http://dev.mysql.com/doc/refman/5.5/en/replication-features-temptables.html (the obvious place to look) doesn’t really explain this well, but http://dev.mysql.com/doc/refman/5.5/en/replication-rbr-usage.html correctly states that ROW based replication doesn’t suffer from this problem as it replicates the values from the temporary table on the master rather than the statement, thus the slave doesn’t have to deal with the temporary table at all. I’ve suggested that the bug be changed to a documentation issue, updating the page on replication and temporary tables to properly explain the issue and point clearly and explicitly to the solution.

So, why would you ever use STATEMENT or MIXED rather than ROW based replication?

Well, as I mentioned, earlier row based wasn’t particularly reliable. At that time, for non-deterministic scenarios we recommended mixed as a compromise (that only uses row based information in the replication stream when it’s necessary, and statements the rest of the time). Many issues have been fixed over time and now we can generally say that row based replication is ok in recent versions of MySQL and MariaDB (5.5 or above, just to be sure). So if you’re replicating from an older master, STATEMENT or MIXED might still be preferable, as long as you know that the limitations are.
Non-local replication (outside the datacenter) is vastly more efficient with STATEMENT based replication: if you’re updating 100,000 rows, it’s a single statement whereas it’s a 100,000 row updates. So depending on bandwidth/cost and such, that might also be a relevant.

If none of those considerations apply, ROW based replication might be the way to go now. But the really important thing to realise is that for each of the choices of STATEMENT, MIXED and ROW, there are advantages and consequences.

Do you have any other reasons for using STATEMENT or MIXED in your environment?

Posted on May 27, 2013 by Arjen Lentz — 1 Comment

Posted on May 23, 2013May 22, 2013 by Arjen Lentz

Non-Deterministic Query in Replication Stream

You might find a warning like the below in your error log:

130522 17:54:18 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statements writing to a table with an auto-increment column after selecting from another table are unsafe because the order in which rows are retrieved determines what (if any) rows will be written. This order cannot be predicted and may differ on master and the slave. Statement: INSERT INTO tbl2 SELECT * FROM tbl1 WHERE col IN (417,523)

What do MariaDB and MySQL mean with this warning? The server can’t guarantee that this exact query, with STATEMENT based replication, will always yield identical results on the slave.

Does that mean that you have to use ROW based (or MIXED) replication? Possibly, but not necessarily.

For this type of query, it primarily refers to the fact that without ORDER BY, rows have no order and thus a result set may show up in any order the server decides. Sometimes it’s predictable (depending on storage engine and index use), but that’s not something you want to rely on. You don’t have to ponder that, as an ORDER BY is never harmful.

Would ORDER BY col solve the problem? That depends!
If col is unique, yes. If col is not unique, then multiple rows could result and they’d still have a non-deterministic order. So in that case you’d need to ORDER BY col,anothercol to make it absolutely deterministic. The same of course applies if the WHERE clause only referred to a single col value: if multiple rows can match, then it’s not unique and it will require an additional column for the sort.

There are other query constructs where going to row based or mixed replication is the only way. But, just because the server tells you it can’t safely replicate a query with statement based replication, that doesn’t mean you can’t use statement based replication at all… there might be another way.

Posted on May 23, 2013May 22, 2013 by Arjen Lentz

Posted on April 23, 2013 by Arjen Lentz — 2 Comments

Galera pre-deployment check

One of the first things we do when preparing a client’s infrastructure for Galera deployment is see whether their schema is suitable.

Avoiding quirks and edge cases, we can say that Galera simply requires all tables to be InnoDB and also have a PRIMARY KEY (obviously having a PK in InnoDB is important anyway, for InnoDB-internal reasons).
We want to know about FULLTEXT indexes. With recent InnoDB versions also supporting FULLTEXT we need to check not just whether a table has such an index, but actually which engine it is.
Spatial indexes. While both InnoDB and MyISAM can deal with spatial datatypes (POINT, GEOMETRY, etc), only MyISAM has the spatial indexes.

Naturally, checking a schema in the server is more effective than going through other sources and possibly missing bits. On the downside, the only viable way to get this info out of MariaDB is INFORMATION_SCHEMA, but because of the way it’s implemented queries tend to be slow and resource intensive. So essentially we do need to ask I_S, but do it as efficiently as possible (we’re dealing with production systems). We have multiple separate questions to ask, which normally we’d ask in separate queries, but in case of I_S that’s really something to avoid. So that’s why it’s all integrated into the single query below, catching every permutation of “not InnoDB”, “lacks primary key”, “has fulltext or spatial index”. We skip the system databases and any VIEWs.

We use the lesser known mysql client command ‘tee’ to output the data into a file, and close it after the query.

We publish the query not as a work of art – I don’t think it’s that pretty! We’d like you to see because we don’t care for secrets and also because if there is any way you can reach the same objective using a less resource intensive approach, we’d love to hear about it! This is one of the very few cases where we care only about efficiency, not how pretty the query looks. That said, of course I’d prefer it to be easily readable.

If you regard it purely as a query to be used for Galera, then you can presume it’ll be run on MariaDB 5.5 or later – since 5.3 and above has optimised subqueries, perhaps you can do something with that.

If you spot any other flaw or missing bit, please comment on that too. thanks!

-- snip
tee galeracheck.txt

SELECT DISTINCT
       CONCAT(t.table_schema,'.',t.table_name) as tbl,
       t.engine,
       IF(ISNULL(c.constraint_name),'NOPK','') AS nopk,
       IF(s.index_type = 'FULLTEXT','FULLTEXT','') as ftidx,
       IF(s.index_type = 'SPATIAL','SPATIAL','') as gisidx
  FROM information_schema.tables AS t
  LEFT JOIN information_schema.key_column_usage AS c
    ON (t.table_schema = c.constraint_schema AND t.table_name = c.table_name
        AND c.constraint_name = 'PRIMARY')
  LEFT JOIN information_schema.statistics AS s
    ON (t.table_schema = s.table_schema AND t.table_name = s.table_name
        AND s.index_type IN ('FULLTEXT','SPATIAL'))
  WHERE t.table_schema NOT IN ('information_schema','performance_schema','mysql')
    AND t.table_type = 'BASE TABLE'
    AND (t.engine <> 'InnoDB' OR c.constraint_name IS NULL OR s.index_type IN ('FULLTEXT','SPATIAL'))
  ORDER BY t.table_schema,t.table_name;

notee
-- snap

Credit: the basis of the “find tables without a PK” is based on SQL by Sheeri Cabral and Giuseppe Maxia.

Posted on April 23, 2013 by Arjen Lentz — 2 Comments

Posted on June 8, 2012 by Arjen Lentz

Jetpants: a toolkit for huge MySQL topologies

From a Tumblr engineering blog post:

Tumblr is one of the largest users of MySQL on the web. At present, our data set consists of over 60 billion relational rows, adding up to 21 terabytes of unique relational data. Managing over 200 dedicated database servers can be a bit of a handful, so naturally we engineered some creative solutions to help automate our common processes.

Today, we’re happy to announce the open source release of Jetpants, Tumblr’s in-house toolchain for managing huge MySQL database topologies. Jetpants offers a command suite for easily cloning replicas, rebalancing shards, and performing master promotions. It’s also a full Ruby library for use in developing custom billion-row migration scripts, automating database manipulations, and copying huge files quickly to multiple remote destinations.

Dynamically resizable range-based sharding allows you to scale MySQL horizontally in a robust manner, without any need for a central lookup service or massive pre-allocation of tiny shards. Jetpants supports this range-based model by providing a fast way to split shards that are approaching capacity or I/O limitations. On our hardware, we can split a 750GB, billion-row pool in half in under six hours.

Jetpants can be obtained via GitHub or RubyGems.

Good work Tumblr, excellent move to open up your tools: you’re bound to get good feedback and bug catches/fixes from users in other environments now, making your toolset even better!

Posted on June 8, 2012 by Arjen Lentz

Share this:

Aurora Testing Limitations

The Need to Know

Share this:

Master/Replica Failover

Role Change Method

Failover Timing

Instance Availability

Caching State

Statistics

Multi-Master

Conclusion

On Killing Connections (through the restart)

Share this:

What is RDS Aurora?

Writes in MySQL

Where InnoDB gets its Speed

Aurora Replication

The Chilling Effect

Being Fast on the Web

Conclusion

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: