Multi-threaded replication is a new feature introduced in MySQL 5.6 and MariaDB 10.0. In traditional single-threaded replication, the slaves have a disadvantage as they have to process in sequence what a master executed in parallel. This, plus the fact that slaves usually have a lot of read-only connections to deal with as well, can easily create performance problems. That is, a single-threaded slave needs to be set to allow fewer connections, otherwise there’s a higher risk of it not being able to keep up with the replication stream. There is no exact rule for this, as it relates to general I/O capacity and fsync latency, as well as general CPU and RAM considerations and query patterns.
Currently, it appears that the MariaDB implementation is a bit more mature in terms of design and effective implementation. For instance, MySQL 5.6 does not currently support retrying transactions while doing parallel replication. This can easily cause problems as commit conflicts are possible and obviously need to be handled. So for the purpose of this blog post, we’re going to focus on MariaDB 10.0, and it is what we currently use with some of our clients. MariaDB developer Kristian Nielsen has done awesome work and is very responsive to questions and bug reports. Rock on, Kristian!
The fundamental challenge for parallel replication is that some queries are safe to be executed in parallel, and some are not – and somehow, the server needs to know which is which. MariaDB employs two strategies to assist with this:
- Group commit. Since 5.5, transactions (remember, a standalone statement without START TRANSACTION/COMMIT is technically also a transaction) that happen around the same time in different connections are grouped in the binary log and effectively committed together. This is accomplished by the server trying to gather at least a certain number of transactions (binlog_commit_wait_count) and having individual connections wait just a fraction (binlog_commit_wait_usec) to increase the chances of gathering a nice number. This strategy reduces I/O and fsyncs, and thus helps quite a bit with write scaling. The miniscule delay that a transaction might incur because it has to wait is easily offset by the overall better performance. It’s good stuff. For the purpose of parallel replication, any transactions in the same group commit can in principle be executed in parallel on a slave – conflicts are possible, so deadlock handling and retries are essential.
- Global Transaction IDs (GTID) Domain IDs (gtid_domain_id) in MariaDB 10.0, which an application can set within a connection. Quite often, different applications and different components of applications use the same database server, but their actions are completely independent: no write operations will ever conflict between the different applications. GTID Domain IDs allows us to tell the server about this, allowing it to always run those transactions in parallel even if they weren’t part of the same group commit! Now that’s a real bonus!
Now, as a practicality, we’re not always able to modify applications to for instance set the GTID Domain ID. Plus, a magic (integer) number is required and so we need some planning/coordination between completely independent applications! Through database server consolidation, you may get applications on your server that were previously on a different one – strictly speaking having two applications use the same GTID Domain ID is harmless (after all, by default all transactions run in the same domain!) but obviously it doesn’t improve performance.
Open Query engineer Daniel Black and I came up with the following. It’s a combination of MySQL’s init_connect system variable (gets called when a user connects, except if they have SUPER privilege), a few stored procedures, and an event to keep the domain map reasonably up-to-date. The premise of this implementation is that each database username uniquely identifies an application, and that no two usernames refer to the same application. So, if you have for instance a general application user but also one for background scripts or one with special administrative privileges, then you need to modify the code in setdomain() a bit to take this into account. If you have transactions with a different GTID Domain ID execute in parallel on the same database, obviously this can cause conflicts. The MariaDB slave threads will retry, but in some cases conflicts cannot be resolved by retrying.
Obviously it’s not perfect, but it does resolve the issue for many situations. Feedback and improvements welcome!
# Automatic GTID Domain IDs for MariaDB 10.0 # Copyright (C) 2014 by Daniel Black & Arjen Lentz, Open Query Pty Ltd (http://openquery.com.au) # Version 2014-11-18, initial publication via OQ blog (https://openquery.com.au/blog/) # # This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International # http://creativecommons.org/licenses/by-sa/4.0/ USE mysql DELIMITER // DROP PROCEDURE IF EXISTS setdomain // CREATE PROCEDURE setdomain(IN cuser varchar(140)) DETERMINISTIC READS SQL DATA SQL SECURITY DEFINER BEGIN DECLARE EXIT HANDLER FOR NOT FOUND SET SESSION gtid_domain_id=10; # modify this logic for your particular application/user naming convention SELECT domain INTO @l_gtid_domain_id FROM mysql.user_domain_map WHERE user=LEFT(cuser, LOCATE('@',cuser) -1 ); SET SESSION gtid_domain_id=@l_gtid_domain_id; END // DROP PROCEDURE IF EXISTS create_user_domain_map // CREATE PROCEDURE create_user_domain_map() MODIFIES SQL DATA BEGIN DECLARE u CHAR(80); DECLARE h CHAR(60); DECLARE userhostcur CURSOR FOR SELECT user,host FROM mysql.user; DECLARE EXIT HANDLER FOR NOT FOUND FLUSH PRIVILEGES; CREATE TABLE IF NOT EXISTS mysql.user_domain_map ( domain INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, user CHAR(80) COLLATE utf8_bin NOT NULL UNIQUE ) AUTO_INCREMENT=10, ENGINE=InnoDB; INSERT IGNORE INTO mysql.user_domain_map(user) SELECT user FROM mysql.user; OPEN userhostcur; LOOP FETCH userhostcur INTO u,h; INSERT IGNORE INTO mysql.procs_priv(Host,Db,User, Routine_name, Routine_type, Grantor, Proc_priv) VALUES(h, 'mysql', u, 'setdomain', 'PROCEDURE', CURRENT_USER(), 'Execute'); END LOOP; END;// DELIMITER ; # (re)create the user domain map CALL create_user_domain_map(); # set up event schedule CREATE EVENT update_user_domain_map ON SCHEDULE EVERY 1 DAY DO CALL create_user_domain_map(); # also set this in my.cnf so it's persistent # init_connect='CALL mysql.setdomain(current_user());' SET GLOBAL init_connect='CALL mysql.setdomain(current_user());';
I recently wrote about non-deterministic queries in the replication stream. That’s resolved by using either MIXED or ROW based replication rather than STATEMENT based.
Another thing that’s not fully handled by STATEMENT based replication is temporary tables. Imagine the following:
- Master: CREATE TEMPORARY TABLE rpltmpbreak (i INT);
- Wait for slave to replicate this statement, then stop and start mysqld (not just STOP/START SLAVE)
- Master: INSERT INTO rpltmpbreak VALUES (1);
- Slave: SHOW SLAVE STATUS \G
If for any reason a slave server shuts down and restarts after the temp table creation, replication will break because the temporary table will no longer exist on the restarted slave server. It’s obvious when you think about it, but nevertheless it’s quite annoying.
A long time ago (early 2007, when I was still working at MySQL AB) I filed a bug report on this. It’s important to realise that back then, row based replication did exist but was so buggy that you wouldn’t recommend it, so the topic was quite relevant. For some reason the bug has remained open for over 6 years until some recent activity.
It is not an issue with determinism and most temporary table constructs are technically regarded as “safe” to replicate via statement based replication, so if you use MIXED you will still find replication broken with the above scenario. Important to realise!
http://dev.mysql.com/doc/refman/5.5/en/replication-features-temptables.html (the obvious place to look) doesn’t really explain this well, but http://dev.mysql.com/doc/refman/5.5/en/replication-rbr-usage.html correctly states that ROW based replication doesn’t suffer from this problem as it replicates the values from the temporary table on the master rather than the statement, thus the slave doesn’t have to deal with the temporary table at all. I’ve suggested that the bug be changed to a documentation issue, updating the page on replication and temporary tables to properly explain the issue and point clearly and explicitly to the solution.
So, why would you ever use STATEMENT or MIXED rather than ROW based replication?
- Well, as I mentioned, earlier row based wasn’t particularly reliable. At that time, for non-deterministic scenarios we recommended mixed as a compromise (that only uses row based information in the replication stream when it’s necessary, and statements the rest of the time). Many issues have been fixed over time and now we can generally say that row based replication is ok in recent versions of MySQL and MariaDB (5.5 or above, just to be sure). So if you’re replicating from an older master, STATEMENT or MIXED might still be preferable, as long as you know that the limitations are.
- Non-local replication (outside the datacenter) is vastly more efficient with STATEMENT based replication: if you’re updating 100,000 rows, it’s a single statement whereas it’s a 100,000 row updates. So depending on bandwidth/cost and such, that might also be a relevant.
If none of those considerations apply, ROW based replication might be the way to go now. But the really important thing to realise is that for each of the choices of STATEMENT, MIXED and ROW, there are advantages and consequences.
Do you have any other reasons for using STATEMENT or MIXED in your environment?
You might find a warning like the below in your error log:
130522 17:54:18 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statements writing to a table with an auto-increment column after selecting from another table are unsafe because the order in which rows are retrieved determines what (if any) rows will be written. This order cannot be predicted and may differ on master and the slave.
Statement: INSERT INTO tbl2 SELECT * FROM tbl1 WHERE col IN (417,523)
What do MariaDB and MySQL mean with this warning? The server can’t guarantee that this exact query, with STATEMENT based replication, will always yield identical results on the slave.
Does that mean that you have to use ROW based (or MIXED) replication? Possibly, but not necessarily.
For this type of query, it primarily refers to the fact that without ORDER BY, rows have no order and thus a result set may show up in any order the server decides. Sometimes it’s predictable (depending on storage engine and index use), but that’s not something you want to rely on. You don’t have to ponder that, as an ORDER BY is never harmful.
Would ORDER BY col solve the problem? That depends!
If col is unique, yes. If col is not unique, then multiple rows could result and they’d still have a non-deterministic order. So in that case you’d need to ORDER BY col,anothercol to make it absolutely deterministic. The same of course applies if the WHERE clause only referred to a single col value: if multiple rows can match, then it’s not unique and it will require an additional column for the sort.
There are other query constructs where going to row based or mixed replication is the only way. But, just because the server tells you it can’t safely replicate a query with statement based replication, that doesn’t mean you can’t use statement based replication at all… there might be another way.
One of the first things we do when preparing a client’s infrastructure for Galera deployment is see whether their schema is suitable.
- Avoiding quirks and edge cases, we can say that Galera simply requires all tables to be InnoDB and also have a PRIMARY KEY (obviously having a PK in InnoDB is important anyway, for InnoDB-internal reasons).
- We want to know about FULLTEXT indexes. With recent InnoDB versions also supporting FULLTEXT we need to check not just whether a table has such an index, but actually which engine it is.
- Spatial indexes. While both InnoDB and MyISAM can deal with spatial datatypes (POINT, GEOMETRY, etc), only MyISAM has the spatial indexes.
Naturally, checking a schema in the server is more effective than going through other sources and possibly missing bits. On the downside, the only viable way to get this info out of MariaDB is INFORMATION_SCHEMA, but because of the way it’s implemented queries tend to be slow and resource intensive. So essentially we do need to ask I_S, but do it as efficiently as possible (we’re dealing with production systems). We have multiple separate questions to ask, which normally we’d ask in separate queries, but in case of I_S that’s really something to avoid. So that’s why it’s all integrated into the single query below, catching every permutation of “not InnoDB”, “lacks primary key”, “has fulltext or spatial index”. We skip the system databases and any VIEWs.
We use the lesser known mysql client command ‘tee’ to output the data into a file, and close it after the query.
We publish the query not as a work of art – I don’t think it’s that pretty! We’d like you to see because we don’t care for secrets and also because if there is any way you can reach the same objective using a less resource intensive approach, we’d love to hear about it! This is one of the very few cases where we care only about efficiency, not how pretty the query looks. That said, of course I’d prefer it to be easily readable.
If you regard it purely as a query to be used for Galera, then you can presume it’ll be run on MariaDB 5.5 or later – since 5.3 and above has optimised subqueries, perhaps you can do something with that.
If you spot any other flaw or missing bit, please comment on that too. thanks!
-- snip tee galeracheck.txt SELECT DISTINCT CONCAT(t.table_schema,'.',t.table_name) as tbl, t.engine, IF(ISNULL(c.constraint_name),'NOPK','') AS nopk, IF(s.index_type = 'FULLTEXT','FULLTEXT','') as ftidx, IF(s.index_type = 'SPATIAL','SPATIAL','') as gisidx FROM information_schema.tables AS t LEFT JOIN information_schema.key_column_usage AS c ON (t.table_schema = c.constraint_schema AND t.table_name = c.table_name AND c.constraint_name = 'PRIMARY') LEFT JOIN information_schema.statistics AS s ON (t.table_schema = s.table_schema AND t.table_name = s.table_name AND s.index_type IN ('FULLTEXT','SPATIAL')) WHERE t.table_schema NOT IN ('information_schema','performance_schema','mysql') AND t.table_type = 'BASE TABLE' AND (t.engine <> 'InnoDB' OR c.constraint_name IS NULL OR s.index_type IN ('FULLTEXT','SPATIAL')) ORDER BY t.table_schema,t.table_name; notee -- snap
Credit: the basis of the “find tables without a PK” is based on SQL by Sheeri Cabral and Giuseppe Maxia.
From a Tumblr engineering blog post:
Tumblr is one of the largest users of MySQL on the web. At present, our data set consists of over 60 billion relational rows, adding up to 21 terabytes of unique relational data. Managing over 200 dedicated database servers can be a bit of a handful, so naturally we engineered some creative solutions to help automate our common processes.
Today, we’re happy to announce the open source release of Jetpants, Tumblr’s in-house toolchain for managing huge MySQL database topologies. Jetpants offers a command suite for easily cloning replicas, rebalancing shards, and performing master promotions. It’s also a full Ruby library for use in developing custom billion-row migration scripts, automating database manipulations, and copying huge files quickly to multiple remote destinations.
Dynamically resizable range-based sharding allows you to scale MySQL horizontally in a robust manner, without any need for a central lookup service or massive pre-allocation of tiny shards. Jetpants supports this range-based model by providing a fast way to split shards that are approaching capacity or I/O limitations. On our hardware, we can split a 750GB, billion-row pool in half in under six hours.
Good work Tumblr, excellent move to open up your tools: you’re bound to get good feedback and bug catches/fixes from users in other environments now, making your toolset even better!
A user on a linux user group mailing list asked about this, and I was one of the people replying. Re-posting here as I reckon it’s of wider interest.
> […] tens of gigs of data in MySQL databases.
> Some in memory tables, some MyISAM, a fair bit InnoDB. According to my
> understanding, when one doesn’t have several hours to take a DB
> offline and do dbbackup, there was/is ibbackup from InnoBase.. but now
> that MySQL and InnoBase have both been ‘Oracle Enterprised’, said
> product is now restricted to MySQL Enterprise customers..
> Some quick searching has suggested Percona XtraBackup as a potential
> FOSS alternative.
> What backup techniques do people employ around these parts for backups
> of large mixed MySQL data sets where downtime *must* be minimised?
> Has your backup plan ever been put to the test?
You should put it to the test regularly, not just when it’s needed.
An untested backup is not really a backup, I think.
At Open Query we tend to use dual master setups with MMM, other replication slaves, mysqldump, and XtracBackup or LVM snapshots. It’s not just about having backups, but also about general resilience, maintenance options, and scalability. I’ll clarify:
- XtraBackup and LVM give you physical backups. that’s nice if you want to recover or clone a complete instance as-is. But if anything is wrong, it’ll be all stuffed (that is, you can sometimes recover InnoDB tablespaces and there are tools for it, but time may not be on your side). Note that LVM cannot snapshot between multiple volumes consistently, so if you have your InnoDB ibdata/IBD files and iblog files on separate spindles, using LVM is not suitable.
- mysqldump for logical (SQL) backups. Most if not all setups should have this. Even if the file(s) were to be corrupted, they’re still readable since it’s plain SQL. You can do partial restores, which is handy in some cases. It’ll be slower to load so having *only* an SQL dump of a larger dataset is not a good idea.
- some of the above backups can and should *also* be copied off-site. that’s for extra safety, but in terms of recovery speed it may not be optimal and should not be relied upon.
- having dual masters is for easier maintenance without scheduled outages, as well as resilience when for instance hardware breaks (and it does).
- slaves. You can even delay a slave (Maatkit has a tool for this), so that would give you a live correct image even in case of a user error, provided you get to it in time. Also, you want enough slack in your infra to be able to initialise a new slave off an existing one. Scaling up at a time when high load is already occurring can become painful if your infra is not prepared for it.
A key issue to consider is this… if the dataset is sufficiently large, and the online requirements high enough, you can’t afford to just have backups. Why? Because, how quickly can you deploy new suitable hardware, install OS, do restore, validate, put back online?
In many cases one or more aspects of the above list simply take too long, so my summary would be “then you don’t really have a backup”. Clients tend to argue with me on that, but only fairly briefly, until they see the point: if a restore takes longer than you can afford, that backup mechanism is unsuitable.
So, we use a combination of tools and approaches depending on needs, but in general terms we aim for keeping the overall environment online (individual machines can and will fail! relying on a magic box or SAN to not fail *will* get you bitten) to vastly reduce the instances where an actual restore is required.
Into that picture also comes using separate test/staging servers to not have developers stuff around on live servers (human error is an important cause of hassles).
In our training modules, we’ve combined the backups, recovery and replication topics as it’s clearly all intertwined and overlapping. Discussing backup techniques separate from replication and dual master setups makes no sense to us. It needs to be put in place with an overall vision.
Note that a SAN is not a backup strategy. And neither is replication on its own.
The following quirky dynamic SQL will scan each index of each table so that they’re loaded into the key_buffer (MyISAM) or innodb_buffer_pool (InnoDB). If you also use the PBXT engine which does have a row cache but no clustered primary key, you could also incorporate some full table scans.
To make mysqld execute this on startup, create /var/lib/mysql/initfile.sql and make it be owned by mysql:mysql
SET SESSION group_concat_max_len=100*1024*1024; SELECT GROUP_CONCAT(CONCAT('SELECT COUNT(`',column_name,'`) FROM `',table_schema,'`.`',table_name,'` FORCE INDEX (`',index_name,'`)') SEPARATOR ' UNION ALL ') INTO @sql FROM information_schema.statistics WHERE table_schema NOT IN ('information_schema','mysql') AND seq_in_index = 1; PREPARE stmt FROM @sql; EXECUTE stmt; DEALLOCATE PREPARE stmt; SET SESSION group_concat_max_len=@@group_concat_max_len;
and in my.cnf add a line in the [mysqld] block
init-file = /var/lib/mysql/initfile.sql
That’s all. mysql reads that file on startup and executes each line. Since we can do the whole select in a single (admittedly quirky) query and then use dynamic SQL to execute the result, we don’t need to create a stored procedure.
Of course this kind of simplistic “get everything” only really makes sense if the entire dataset+indexes fit in memory, otherwise you’ll want to be more selective. Still, you could use the above as a basis, perhaps using another table to provide a list of tables/indexes to be excluded – or if the schema is really stable, simply have a list of tables/indexes to be included instead of dynamically using information_schema.
Practical (albeit niche) application:
In a system with multiple slaves, adding in a new slave makes it start with cold caches, but since with loadbalancing it will pick up only some of the load it often works out ok. However, some environments have dual masters but the application is not able to do read/write splits to utilise slaves. In that case all the reads also go to the active master. Consequentially, the passive master will have relatively cold caches (only rows/indexes that have been updated will be in memory) so in case of a failover the amount of disk reads for the many concurrent SELECT queries will go through the roof – temporarily slowing the effective performance to a dismal crawl: each query takes longer with the required additional disk access so depending on the setup the server may even run out of connections which in turn upsets the application servers. It’d sort itself out but a) it looks very bad on the frontend and b) it may take a number of minutes.
The above construct prevents that scenario, and as mentioned it can be used as a basis to deal with other situations. Not many people know about the init-file option, so this is a nice example.
If you want to know how the SQL works, read on. The original line is very long so I’ll reprint it below with some reformatting:
SELECT GROUP_CONCAT(CONCAT( 'SELECT COUNT(`',column_name,'`) FROM `',table_schema,'`.`',table_name, '` FORCE INDEX (`',index_name,'`)' ) SEPARATOR ' UNION ALL ') INTO @sql FROM information_schema.statistics WHERE table_schema NOT IN ('information_schema','mysql') AND seq_in_index = 1;
The outer query grabs each regular db/table/index/firstcol name that exists in the server, writing out a SELECT query that counts all not-NULL values of the indexed column (so it must scan the index), forcing that specific index. We then abuse the versatile and flexible GROUP_CONCAT() function to glue all those SELECTs together, with “UNION ALL” inbetween. The result is a single very long string, so we need to tweak the maximum allowed group_concat output beforehand to prevent truncation.
Open Query currently hosts a large part of our infrastructure at Linode. We are extremely happy with their performance, stability and support. Unfortunately any chain is only as strong as it’s weakest link. This week, there was a major thunderstorm near the Hurricane Electric datacenter (anyone else think that name is funny in combination with the event in case?) in fremont and through a massive powersurge, most of HE’s datacenter lost power. Among the Linodes affected in our infrastructure were all of the machines involved in our MMM setup.
The masters came back up before the monitor, which is around the time I was alerted. Logging in, I noticed replication was broken on one of the masters, but the other master seemed healthy. Since the monitor was not up and it seemed like it could potentially be hours before it would, I decided it was time for manual action. Since our MMM setup doesn’t have slaves currently, I decided a good option would be to mimic MMM and move the virtual IP to the healthy server.
I executed the following manual commands to make the desired changes:
$ ip addr add <virtip> dev eth0
$ /usr/sbin/arping -I eth0 -c 5 <virtip>
That brought all our applications back online, which was the desired effect. I manually fixed replication by repositioning the masters. A while later, the monitor came up and automatically took over, bringing everything back to normal.
Everything went well, but it wasn’t until the next morning I realised there was a possible flaw in my logic (that din’t effect us, but I wanted to blog about it to make others realise): When replication stopped, master A was active. My commands above made master B the active master. Now, in theory it is possible that writes were sent to master A after replication broke, and commands that were sent to master B would presume those writes were executed there which they were not as replication didn’t execute them. This is one of those niche occasions where data-drift can occur without noticing it.
My recommendation is to not do what I did unless you are very certain your setup doesn’t suffer from this potential problem. If you do decide to use this trick however, make sure to use the maatkit mk-tablecheck and mk-tablesynch when all is well again to check for (and correct!) data drift.
This is a Request for Input. Dual MySQL masters with MMM in a single datacentre are in common use, and other setups like DRBD and of course VM/SAN based failover solutions are conceptually straightforward also. Thus, achieving various forms of resilience within a single data-centre is doable and not costly.
Doing the same across multiple (let’s for simplicity sake limit it to two) datacentres is another matter. MySQL replication works well across longer links, and it can use MySQL’s in-built SSL or tools like stunnel. Of course it needs to be kept an eye on, as usual, but since it’s asynchronous the latency between the datacentres is not a big issue (apart from the fact that the second server gets up-to-date a little bit later).
But as those who have tried will know, having a client (application server) connection to a MySQL instance in a remote data-centre is a whole other matter, latency becomes a big issue and is generally very noticeable on the front-end. One solution for that is to have application servers only connect to their “local” MySQL server.
So the question to you is, do you now have (or have you had in the past) a setup with MySQL masters in different datacentres, what did that setup look like (which additional tools and infra did you use for it), and what were your experiences (good and bad, solutions to issues, etc). I’m trying to gather additional expertise that might already be about, which can help us all. Please add your input! thanks