MySQL upgrades are necessary tasks and we field a variety of questions here at Percona Support regarding MySQL upgrade best practices. This post highlights recommended ways to upgrade MySQL in different scenarios.
Why are MySQL upgrades needed? The reasons are many and include: Access to new features, performance benefits, bug fixes…. However, MySQL upgrades can be risky if not tested extensively beforehand with your application because the process might break it, prevent the application from functioning properly – or performance issues could arise following the upgrade. Moreover, I suggest keeping an eye on new releases of MySQL and Percona Server – check what has changed in the most recent version. Perhaps the latest release has a fix for an issue that you have been experiencing.
Upgrading one major version via SQL Dump:
Upgrading between one major version covers upgrading from Percona Server 5.1 to 5.5 or Percona Server 5.5 to 5.6 and the same implies to Oracle MySQL.
First of all, upgrading between one major version is neither straightforward nor risk-free. Initially you should read “Upgrading from Previous Series” documentation here and here. In that documentation, please place special attention to all of the sections marked “Incompatible Change” and check whether you may be affected by those changes. There might be configuration changes as well as variables renamed, a few older variables obsoleted and new variables introduced – so make sure that you adjust your my.cnf accordingly. For Percona Server specific changes please refer here and here for Percona Server 5.5 & Percona Server 5.6, respectively.
Now there are several possible approaches you may take, where one may be more feasible than the other depending on the current replication topology and total data size – and one might also be safer than another. Let me show you an upgrade procedure… an example upgrading from Percona Server 5.5 to Percona Server 5.6.
In general, there are two types of MySQL upgrades:
- In place, where you use existing datadir against the new MySQL major version, with just running mysql_upgrade after binaries are upgraded,
- SQL dump on an old version and then restore it on a new version (using mysqldump utility or alternatives, like mydumper).
Also in general the second type is safer, but as you may expect a much slower MySQL upgrade process.
Theoretically, the safest scenario is:
- Dump all user grants (using http://www.percona.com/doc/percona-toolkit/2.2/pt-show-grants.html)
- Dump all data (except the mysql database) from MySQL 5.5 into SQL dump and restore on MySQL 5.6
- Restore the user grants
Here’s a basic procedure (you should stop application writes before starting).
1) Capture users and permissions information. This will backup all your existing user privileges.$ wget percona.com/get/pt-show-grants; $ perl pt-show-grants --user=root --ask-pass --flush > /root/grants.sql
2) Produce a logical dump of the 5.5 instance, excluding the mysql, information_schema and performance_schema databases:$ mysql -BNe "SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME NOT IN ('mysql', 'performance_schema', 'information_schema')" | tr 'n' ' ' > /root/dbs-to-dump.sql $ mysqldump --routines --events --single-transaction --databases $(cat /root/dbs-to-dump.sql) > /root/full-data-dump.sql
3) Stop 5.5 instance.$ service mysql stop or $ /etc/init.d/mysql stop
4) Move old datadir (assuming /var/lib/mysql; edit accordingly to your setup):$ mv /var/lib/mysql/ /var/lib/mysql-55
5) Install 5.6 (simply as you would do when not upgrading). If you don’t use a package manager (yum/apt-get) then is likely that you need to run mysql_install_db and mysql_upgrade.
6) Load the users back to new upgraded version of MySQL.$ mysql -uroot < /root/grants.sql
7) Load the data back to new version of MySQL.$ mysql -e "SET GLOBAL max_allowed_packet=1024*1024*1024"; $ mysql -uroot -p --max-allowed-packet=1G < /root/full-data-dump.sql;
At this point all tables have been re-created and loaded in MySQL 5.6 so every binary representation is native to MySQL 5.6. You’ve also completed the cleanest/most-stable upgrade path and your application can resume service – and for that reason it’s worth mentioning that this upgrade path is the same with either upgrading vanila MySQL or Percona Server. Further, you may upgrade from Oracle MySQL to Percona Server, for example, upgrading Oracle MySQL 5.5 to Percona Server 5.6. Again, the MySQL upgrade path as described would be the same as Percona Server, which is a drop-in replacement of Oracle MySQL.
“SQL dump” is also known as a logical backup. It is safer in the sense that when restoring, all tables will be created using the format of the new MySQL binaries you’re using, which bypasses compatibility issues in general. Still for large data like data in terabytes, gigabytes… this may be a very time-consuming approach. On the other hand, by dumping/reloading such large data sets, it is possible that you will be able to recover a lot of free space on the disk as the InnoDB table spaces will be re-created from scratch, thus optimized and defragmented. If the data was often updated/deleted, the benefits may be significant.
Minor version MySQL upgrade within the same major version via In-Place Upgrade:
This implies to upgrading within the same series e.g. MySQL 5.5.35 to MySQL 5.5.38 or Percona Server 5.6.14 to latest Percona Server 5.6.20.
This is known as an in-place upgrade, where you just install a newer binary package and then run mysql_upgrade script, which checks and updates system tables if necessary. Still, with the in-place upgrade we highly recommend checking release notes for new features, bug fixes, etc. For Percona Server 5.5 and Percona Server 5.6, release notes can be found here and here respectively.
For Percona Server we have additional documents describing some details when it comes to upgrading Percona Server with a focus on Percona-specific features that can be found here and here. This also covers complete In-Place Upgrade procedure with the yum/apt package manager.
Also, to be on safe side you can do the upgrade with a logical dump using the earlier described procedure via mysqldump or mydumper program – where the former does parallel backups and restore and logical backup – and is the safest approach for the upgrade.
MySQL Upgrade directly to the latest version by skipping one major version in between:
This includes upgrading from MySQL 5.0 to MySQL 5.5 by skipping version 5.1 in between or upgrading MySQL 5.1 to MySQL 5.6 by skipping version 5.5 in between. Further, this also includes upgrading to MySQL 5.6 directly from MySQL 5.0 although there should be very few users still using MySQL version 5.0. This also implies to Percona Server.
For the topic, we would assume upgrading from Oracle MySQL or Percona Server 5.1 directly to version 5.6 by skipping one major version 5.5 in between.
Before anything, this is a serious upgrade, and a huge step over one major MySQL version. That is, it’s risky. Upgrading by using just binaries update is not supported and it’s not safe skipping major versions in between, so you should never do this from 5.0->5.5, 5.1->5.6, and surely not for 5.0->5.6. One problem is that not all changes in MySQL versions are backwards compatible. Some differences were introduced that may affect both how the data is handled, but also how the server behaves including both SQL language and MySQL server and storage engines internals. Another thing is that between MySQL 5.0 and 5.6 versions, a number of default setting variables were changed, which may result in completely different, unexpected behavior. For example since MySQL 5.5 the default storage engine is InnoDB and since MySQL 5.6 by default InnoDB will use a separate tablespace for each table and GTID replication was also introduced. But there are many more details which I won’t list here. All of those changes are described in “Upgrading from Previous Series” documentation as described above.
It’s worth mentioning that upgrading by skipping one major version is highly not recommended. Upgrading from MySQL 5.1 to 5.6 shouldn’t be done in one shot. Instead, I would suggest upgrading from version 5.1 to 5.5 and then from version 5.5 to 5.6 and running mysql_upgrade at each step. That will cope with the changes in formats as explained in the manual.
MySQL Upgrade Precautions:
MySQL upgrade precautions are an essential part of the upgrade itself. Before you upgrade make sure you have thoroughly tested all application parts with the desired version of MySQL. This is especially needed for an upgrade between major versions or if you are upgrading by skipping one major version in-between (e.g. upgrade from MySQL 5.1 to MySQL 5.6).
Make sure you read release notes carefully and that you are aware of all the changes. You can find Oracle MySQL 5.5 and 5.6 release notes as follows:
While Percona Server specific release notes can be found below for same versions as described above.
If you are planning to upgrade to Oracle MySQL 5.6 or Percona Server 5.6 I would recommend first checking for existing critical bugs. Bugs you should aware of:
This is yet another important aspect of any MySQL upgrade. You should plan your upgrade along with an upgrade hierarchy. This is always recommend: upgrade your dev/QA servers first, then staging server’s before moving to production. In fact, you can spare upgraded instances where you have desired upgraded versions of MySQL and then test your application extensively.
Once you are happy with the MySQL upgrade on your test servers, staging servers, etc., then you can begin the MySQL upgrade on your production servers. In replication environments we highly recommend upgrading the MySQL slaves first (one by one) and then finally upgrading the MySQL master. In reality, you can upgrade one of the slaves first and run it for few days to be on safe side – all the while closely monitoring its performance. If you don’t have a replication setup it may be worth creating a replica to test the newer version of MySQL on it first. Once you are happy with the results you can upgrade remaining the slaves and finally the master.
How Percona software helps you in a MySQL upgrade:
In any MySQL upgrade, Percona Toolkit comes to the rescue. Percona Tookit contains a number of tools that help a great deal.
pt-upgrade is one of such tool. It allows you to test whether the new MySQL instance handles some specific queries at least as fast as old version. There may be some substantial differences as the MySQL query optimizer has changed a lot between versions 5.1 and 5.6 and also data statistics may be refreshed, hence the query plan may change. You can check further in the manual about optimizer changes.
pt-query-digest is another great tool that might help you in the upgrade. You can replay your slow query log against existing and new desired MySQL versions for before and after query performance validation.
You can also benefit from Percona Cloud Tools for MySQL which is a hosted service providing access to query performance insights for all MySQL uses. You can signup for free now because this service is in public beta. Percona Cloud Tools, among other things, allows you to visually check your queries performance after a MySQL upgrade.
It’s highly recommended to backup your data before your MySQL upgrade. Percona XtraBackup is free and open source (like all Percona software). It’s a hot backup tool which backs-up your data online without scarifying read/write ability from the database and it will backup your data with minor impact.
Last but not least, You will find this post pretty useful, too: “Upgrading MySQL.” It’s a few years old but still very relevant. And also take a look at this informative webinar, “Upgrading to MySQL 5.6: Best Practices.” Both are from Percona CEO Peter Zaitsev.
A MySQL upgrade might look like a simple task – but actually it’s not. I’ve tried to cover most of the MySQL upgrade scenarios in this post that you will encounter. Again, I recommend to briefly test your application parts before pushing it “live,” otherwise it may break your application or part of it – or may minimize performance instead of a performance gain. Finally, I recommend having a downgrade plan in place before the MySQL upgrade just in case something goes wrong. Planning a proper downgrade procedure will minimize your app downtime when things go wrong. I’m looking forward to your comments and questions below.
So, I’ve been looking around for a while (and a few times now) for any good resources that cover a bunch of MySQL architecture and technical details aimed towards the technically proficient but not MySQL literate audience. I haven’t really found anything. I mean, there’s the (huge and very detailed) MySQL manual, there’s the MySQL Internals manual (which is sometimes only 10 years out of date) and there’s various blog entries around the place. So I thought I’d write something explaining roughly how it all fits together and what it does to your system (processes, threads, IO etc).(Basically, I’ve found myself explaining this enough times in the past few years that I should really write it down and just point people to my blog). I’ve linked to things for more reading. You should probably read them at some point.
Years ago, there were many presentations on MySQL Architecture. I went to try and find some on YouTube and couldn’t. We were probably not cool enough for YouTube and the conferences mostly weren’t recorded. So, instead, I’ll just link to Brian on NoSQL – because it’s important to cover NoSQL as well.
So, here is a quick overview of executing a query inside a MySQL Server and all the things that can affect it. This isn’t meant to be complete… just a “brief” overview (of a few thousand words).
MySQL is an open source relational database server, the origins of which date back to 1979 with MySQL 1.0 coming into existence in 1995. It’s code that has some history and sometimes this really does show. For a more complete history, see my talk from linux.conf.au 2014: Past, Present and Future of MySQL (YouTube, Download).
The MySQL Server runs as a daemon (mysqld). Users typically interact with it over a TCP or UNIX domain socket through the MySQL network protocol (of which multiple implementations exist under various licenses). Each connection causes the MySQL Server (mysqld) to spawn a thread to handle the client connection.
There are now several different thread-pool plugins that instead of using one thread per connection, multiplex connections over a set of threads. However, these plugins are not commonly deployed and we can largely ignore them. For all intents and purposes, the MySQL Server spawns one thread per connection and that thread alone performs everything needed to service that connection. Thus, parallelism in the MySQL Server is gotten from executing many concurrent queries rather than one query concurrently.
The MySQL Server will cache threads (the amount is configurable) so that it doesn’t have to have the overhead of pthread_create() for each new connection. This is controlled by the thread_cache_size configuration option. It turns out that although creating threads may be a relatively cheap operation, it’s actually quite time consuming in the scope of many typical MySQL Server connections.
Because the MySQL Server is a collection of threads, there’s going to be thread local data (e.g. connection specific) and shared data (e.g. cache of on disk data). This means mutexes and atomic variables. Most of the more advanced ways of doing concurrency haven’t yet made it into MySQL (e.g. RCU hasn’t yet and is pretty much needed to get 1 million TPS), so you’re most likely going to see mutex contention and contention on cache lines for atomic operations and shared data structures.
There are also various worker threads inside the MySQL Server that perform various functions (e.g. replication).
Until sometime in the 2000s, more than one CPU core was really uncommon, so the fact that there were many global mutexes in MySQL wasn’t really an issue. These days, now that we have more reliable async networking and disk IO system calls but MySQL has a long history, there’s global mutexes still and there’s no hard and fast rule about how it does IO.
Over the past 10 years of MySQL development, it’s been a fight to remove the reliance on global mutexes and data structures controlled by them to attempt to increase the number of CPU cores a single mysqld could realistically use. The good news is that it’s no longer going to max out on the number of CPU cores you have in your phone.
So, you have a MySQL Client (e.g. the mysql client or something) connecting to the MySQL Server. Now, you want to enter a query. So you do that, say “SELECT 1;”. The query is sent to the server where it is parsed, optimized, executed and the result returns to the client.
Now, you’d expect this whole process to be incredibly clean and modular, like you were taught things happened back in university with several black boxes that take clean input and produce clean output that’s all independent data structures. At least in the case of MySQL, this isn’t really the case. For over a decade there’s been lovely architecture diagrams with clean boxes – the code is not like this at all. But this probably only worries you once you’re delving into the source.
The parser is a standard yacc one – there’s been attempts to replace it over the years, none of which have stuck – so we have the butchered yacc one still. With MySQL 5.0, it exploded in size due to the addition of things like SQL2003 stored procedures and it is of common opinion that it’s rather bloated and was better in 4.1 and before for the majority of queries that large scale web peeps execute.
There is also this thing called the Query Cache – protected by a single global mutex. It made sense in 2001 for a single benchmark. It is a simple hash of the SQL statement coming over the wire to the exact result to send(2) over the socket back to a client. On a single CPU system where you ask the exact same query again and again without modifying the data it’s the best thing ever. If this is your production setup, you probably want to think about where you went wrong in your life. On modern systems, enabling the query cache can drop server performance by an order of magnitude. A single global lock is a really bad idea. The query cache should be killed with fire – but at least in the mean time, it can be disabled.
Normally, you just have the SQL progress through the whole process of parse, optimize, execute, results but the MySQL Server also supports prepared statements. A prepared statement is simply this: “Hi server, please prepare this statement for execution leaving the following values blank: X, Y and Z” followed by “now execute that query with X=foo, Y=bar and Z=42″. You can call execute many times with different values. Due to the previously mentioned not-quite-well-separated parse, optimize, execute steps, prepared statements in MySQL aren’t as awesome as in other relational databases. You basically end up saving parsing the query again when you execute it with new parameters. More on prepared statements (from 2006) here. Unless you’re executing the same query many times in a single connection, server side prepared statements aren’t worth the network round trips.
The absolute worst thing in the entire world is MySQL server side prepared statements. It moves server memory allocation to be the responsibility of the clients. This is just brain dead stupid and a reason enough to disable prepared statements. In fact, just about every MySQL client library for every programming language ever actually fakes prepared statements in the client rather than trust every $language programmer to remember to close their prepared statements. Open many client connections to a MySQL Server and prepare a lot of statements and watch the OOM killer help you with your DoS attack.
So now that we’ve connected to the server, parsed the query (or done a prepared statement), we’re into the optimizer. The optimizer looks at a data structure describing the query and works out how to execute it. Remember: SQL is declarative, not procedural. The optimizer will access various table and index statistics in order to work out an execution plan. It may not be the best execution plan, but it’s one that can be found within reasonable time. You can find out the query plan for a SELECT statement by prepending it with EXPLAIN.
The MySQL optimizer is not the be all and end all of SQL optimizers (far from it). A lot of MySQL performance problems are due to complex SQL queries that don’t play well with the optimizer, and there’s been various tricks over the years to work around deficiencies in it. If there’s one thing the MySQL optimizer does well it’s making quick, pretty good decisions about simple queries. This is why MySQL is so popular – fast execution of simple queries.
To get table and index statistics, the optimizer has to ask the Storage Engine(s). In MySQL, the actual storage of tables (and thus the rows in tables) is (mostly) abstracted away from the upper layers. Much like a VFS layer in an OS kernel, there is (for some definition) an API abstracting away the specifics of storage from the rest of the server. The API is not clean and there are a million and one layering violations and exceptions to every rule. Sorry, not my fault.
Table definitions are in FRM files on disk, entirely managed by MySQL (not the storage engines) and for your own sanity you should not ever look into the actual file format. Table definitions are also cached by MySQL to save having to open and parse a file.
Originally, there was MyISAM (well, and ISAM before it, but that’s irrelevant now). MyISAM was non-transactional but relatively fast, especially for read heavy workloads. It only allowed one writer although there could be many concurrent readers. MyISAM is still there and used for system tables. The current default storage engine is called InnoDB. It’s all the buzzwords like ACID and MVCC. Just about every production environment is going to be using InnoDB. MyISAM is effectively deprecated.
InnoDB originally was its own independent thing and has (to some degree) been maintained as if it kind of was. It is, however, not buildable outside a MySQL Server anymore. It also has its own scalability issues. A recent victory was splitting the kernel_mutex, which was a mutex that protected far too much internal InnoDB state and could be a real bottleneck where NRCPUs > 4.
So, back to query execution. Once the optimizer has worked out how to execute the query, MySQL will start executing it. This probably involves accessing some database tables. These are probably going to be InnoDB tables. So, MySQL (server side) will open the tables, looking up the MySQL Server table definition cache and creating a MySQL Server side table share object which is shared amongst the open table instances for that table. See here for scalability hints on these (from 2009). The opened table objects are also cached – table_open_cache. In MySQL 5.6, there is table_open_cache_instances, which splits the table_open_cache mutex into table_open_cache_instances mutexes to help reduce lock contention on machines with many CPU cores (> 8 or >16 cores, depending on workload).
Once tables are opened, there are various access methods that can be employed. Table scans are the worst (start at the start and examine every row). There’s also index scans (often seeking to part of the index first) and key lookups. If your query involves multiple tables, the server (not the storage engine) will have to do a join. Typically, in MySQL, this is a nested loop join. In an ideal world, this would all be really easy to spot when profiling the MySQL server, but in reality, everything has funny names like rnd_next.
As an aside, any memory allocated during query execution is likely done as part of a MEM_ROOT – essentially a pool allocator, likely optimized for some ancient libc on some ancient linux/Solaris and it just so happens to still kinda work okay. There’s some odd server configuration options for (some of the) MEM_ROOTs that get exactly no airtime on what they mean or what changing them will do.
InnoDB has its own data dictionary (separate to FRM files) which can also be limited in current MySQL (important when you have tens of thousands of tables) – which is separate to the MySQL Server table definitions and table definition cache.
But anyway, you have a number of shared data structures about tables and then a data structure for each open table. To actually read/write things to/from tables, you’re going to have to get some data to/from disk.
InnoDB tables can be stored either in one giant table space or file-per-table. (Even though it’s now configurable), InnoDB database pages are 16kb. Database pages are cached in the InnoDB Buffer Pool, and the buffer-pool-size should typically be about 80% of system memory. InnoDB will use a (configurable) method to flush. Typically, it will all be O_DIRECT (it’s configurable) – which is why “just use XFS” is step 1 in IO optimization – the per inode mutex in ext3/ext4 just doesn’t make IO scale.
InnoDB will do some of its IO in the thread that is performing the query and some of it in helper threads using native linux async IO (again, that’s configurable). With luck, all of the data you need to access is in the InnoDB buffer pool – where database pages are cached. There exists innodb_buffer_pool_instances configuration option which will split the buffer pool into several instances to help reduce lock contention on the InnoDB buffer pool mutex.
All InnoDB tables have a clustered index. This is the index by which the rows are physically sorted by. If you have an INT PRIMARY KEY on your InnoDB table, then a row with that primary key value of 1 will be physically close to the row with primary key value 2 (and so on). Due to the intricacies of InnoDB page allocation, there may still be disk seeks involved in scanning a table in primary key order.
Every page in InnoDB has a checksum. There was an original algorithm, then there was a “fast” algorithm in some forks and now we’re converging on crc32, mainly because Intel implemented CPU instructions to make that fast. In write heavy workloads, this used to show up pretty heavily in profiles.
InnoDB has both REDO and UNDO logging to keep both crash consistency and provide consistent read views to transactions. These are also stored on disk, the redo logs being in their own files (size and number are configurable). The larger the redo logs, the longer it may take to run recovery after a crash. The smaller the redo logs, the more trouble you’re likely to run into with large or many concurrent transactions.
If your query performs writes to database tables, those changes are written to the REDO log and then, in the background, written back into the table space files. There exists configuration parameters for how much of the InnoDB buffer pool can be filled with dirty pages before they have to be flushed out to the table space files.
In order to maintain Isolation (I in ACID), InnoDB needs to assign a consistent read view to a new transaction. Transactions are either started explicitly (e.g. with BEGIN) or implicitly (e.g. just running a SELECT statement). There has been a lot of work recently in improving the scalability of creating read views inside InnoDB. A bit further in the past there was a lot of work in scaling InnoDB for greater than 1024 concurrent transactions (limitations in UNDO logging).
Fancy things that make InnoDB generally faster than you’d expect are the Adaptive Hash Index and change buffering. There are, of course, scalability challenges with these too. It’s good to understand the basics of them however and (of course), they are configurable.
If you end up reading or writing rows (quite likely) there will also be a translation between the InnoDB row format(s) and the MySQL Server row format(s). The details of which are not particularly interesting unless you’re delving deep into code or wish to buy me beer to hear about them.
Query execution may need to get many rows from many tables, join them together, sum things together or even sort things. If there’s an index with the sort order, it’s better to use that. MySQL may also need to do a filesort (sort rows, possibly using files on disk) or construct a temporary table in order to execute the query. Temporary tables are either using the MEMORY (formerly HEAP) storage engine or the MyISAM storage engine. Generally, you want to avoid having to use temporary tables – doing IO is never good.
Once you have the results of a query coming through, you may think that’s it. However, you may also be part of a replication hierarchy. If so, any changes made as part of that transaction will be written to the binary log. This is a log file maintained by the MySQL Server (not the storage engines) of all the changes to tables that have occured. This log can then be pulled by other MySQL servers and applied, making them replication slaves of the master MySQL Server.
We’ll ignore the differences between statement based replication and row based replication as they’re better discussed elsewhere. Being part of replication means you get a few extra locks and an additional file or two being written. The binary log (binlog for short) is a file on disk that is appended to until it reaches a certain size and is then rotated. Writes to this file vary in size (along with the size of transactions being executed). The writes to the binlog occur as part of committing the transaction (and the crash safety between writing to the binlog and writing to the storage engine are covered elsewhere – basically: you don’t want to know).
If your MySQL Server is a replication slave, then you have a thread reading the binary log files from another MySQL Server and then another thread (or, in newer versions, threads) applying the changes.
If the slow query log or general query log is enabled, they’ll also be written to at various points – and the current code for this is not optimal, there be (yes, you guess it) global mutexes.
Once the results of a query have been sent back to the client, the MySQL Server cleans things up (frees some memory) and prepares itself for the next query. You probably have many queries being executed simultaneously, and this is (naturally) a good thing.
There… I think that’s a mostly complete overview of all the things that can go on during query execution inside MySQL.
It’s been a little while since I blogged on MySQL on POWER (last time was thinking that new releases would be much better for running on POWER). Well, I recently grabbed the MySQL 5.6.20 source tarball and had a go with it on a POWER8 system in the lab. There is good news: I now only need one patch to have it function pretty flawlessly (no crashes). Unfortunately, there’s still a bit of an odd thing with some of the InnoDB mutex code (bug filed at some point soon).
But, with this one patch applied, I was getting okay sysbench results and things are looking good.
Now just to hope the MySQL team applies my other patches that improve things on POWER. To be honest, I’m a bit disappointed many of them have sat there for this long… it doesn’t help build a development community when patches can sit for months without either “applied” or “fix these things first”. That being said, just as I write this, Bug 72809 which I filed has been closed as the fix has been merged into 5.6.22 and 5.7.6, so there is hope… it’s just that things can just be silent for a while. It seems I go back and forth on how various MySQL variants are going with fostering an external development community.
I need to store exponentially increasing amounts of data and analyze all of it in real-time.
This is also known simply as: “We have big data.” Typically, this data is used for user interaction analysis, ad tracking, or other common click stream applications. However, it can also be seen in threat assessment (ddos mitigation, etc), financial forecasting, and other applications as well. While MySQL (and other OLTP systems) can handle this to a degree, it is by no means a forte. Some of the pain points include:
- Cost of rapidly increasing, expensive disk storage (OLTP disks need to be fast == $$)
- Performance decrease as the data size increases
- Wasted hardware resources (excess I/O, etc)
- Impact against other time-sensitive transactions (i.e. OLTP workload)
While there are many approaches to this problem – and often times, the solution is actually a hybrid of many individually tailored components – a solution that I have seen more frequently in recent work is HP Vertica.
At the 30,000 foot overview, Vertica is built around the following principles:
- Columnar data store
- Highly compressed data
- Clustered solution for both availability and scalability
Over the next few weeks, I’ll discuss several aspects of Vertica including:
- Underlying architecture and concepts
- Basic installation and use
- Different data loading techniques
- Some various maintenance/operational procedures
- Some comparisons vs. traditional OLTP (MySQL) performance
- Some potential use-cases
- Integration with other tools (such as Hadoop)
While Vertica is by no means the silver bullet that will solve all of your needs, it may prove to be a very valuable tool in your overall approach to managing big data.
I recently encountered a situation in which after running Percona Toolkit’s pt-table-checksum on a customer system, 95% of the table on the MySQL master was different on the MySQL slave. Although this table was not a critical part of the infrastructure, from time to time, writes to the table from the master would break replication. Additionally, this table has about 6 million rows, and running pt-table-sync would take sometime. Rebuilding the slave from backup of the master would not be an easy option as well since the slave acts as an archive where it has a lot more data than the master.
So how did we solve it? With pt-online-schema-change and a NOOP ALTER.pt-online-schema-change --alter 'ENGINE=INNODB' D=dbname,t=tblname
How is it possible? pt-online-schema-change works by creating a shadow copy of the original table and start copying the rows from the latter to the former. An additional set of TRIGGERs also ensures that any additional changes to existing rows after they have been copied to the shadow version will propagate.
There is little one caveat though, the binlog_format on the master would have to be ROW so the actual ROW images from the master would be copied to the slave. If your binlog_format is set to STATEMENT, you’d still end up with the same inconsistency. Since change statements will be logged as STATEMENT, the NOOP ALTER will operate on the slave table instead of copying from the master. You can configure the binlog_format for the alter process via the –set-vars option i.e. --set-vars 'binlog_format=ROW' to make it more easier!
The post Syncing MySQL slave table with pt-online-schema-change appeared first on MySQL Performance Blog.
Percona Live London 2014 is fast approaching – November is just around the corner. This year’s conference, November 3-4, will be even bigger and better than last year thanks to the participation of leading MySQL experts the world over (including you!).
The Percona Live London MySQL Conference is a great event for users of any level using any of the major MySQL branches: MySQL, MariaDB or Percona Server. And this year we once again host a star-studded group of keynote speakers from industry-leading companies in the MySQL space.
We’ll also be welcoming leading MySQL practitioners from across the industry (and from all corners of the world) who will speak on topics that matter to you now – see the full conference schedule here:
Monday starts early with a full day of tutorials and a fun evening at the community dinner. Attendees will be arriving in true London style on a double-decker bus! Tuesday morning will kick-off with a series of keynotes followed by interactive breakout sessions – wrapping things up at the end of the day with a fun post-conference reception (a great chance to make new friends and reconnect with old ones).
Here’s a sneak peek at some of the must-see events this year:
- Oracle’s Tomas Ulin will speak on “MySQL: It Just Keeps Getting Better”
- Facebook’s Shlomo Priymak will explain “MySQL Automation at Facebook Scale”
- I’ll update everyone on the “Ever Evolving MySQL Landscape
- Booking.com’s Jean-François Gagné will discuss “High Availability, Disaster Recovery and Extreme Read Scaling using Binlog Servers”
- Continuent’s Robert Hodges will talk about how “MySQL Comes of Age”
- Codership’s Alexey Yurchenko will make a “Synchronous Revelation”
- Tim Callaghan of Tokutek will share “Performance Bechmarking: Tips, Tricks, and Lessons Learned”
- Baron Schwartz of VividCortex will layout a blueprint for “Building a Time-Series Database on MySQL”
- OlinData’s Walter Heck will walk you through “Moving a MySQL infrastructure with 130k QPS to Galera”
To recap, here are the Top Ten reasons to attend Percona Live London this November 3-4:
10. Advanced Rate Pricing ends October 5th
9. Hear about the hottest current topics and trends.
8. Network! Meet face-to-face in the “hallway track” and make lasting connections.
7. Learn how to make MySQL work better for you – regardless of your expertise.
6. Have a blast at the community dinner!
5. Discuss your unique challenges with experts and discover options for solving them.
4. Engage with the sponsors at their tabletop exhibits.
3. Listen to top industry leaders describe the future of the MySQL ecosystem
2. Learn what works, and what doesn’t, from leading companies using MySQL
1. And the Number 1 reason to attend Percona Live London 2014: ALL of the above!
I look forward to seeing you in London this November and don’t forgot that Advanced Rate Pricing pricing ends October 5 so be sure to register now!
The post Percona Live London: Top Ten reasons to attend Nov. 3-4 appeared first on MySQL Performance Blog.
Today I managed to finally find a way to express what I’ve been thinking for a while: “Python is the new BASIC”. Think about it: it’s easy to get started in, there’s books and tutorials on it everywhere, a bunch of real world software is actually written in it and with all the different versions and modules (and versions of modules) there’s a billion subtle differences to trip you up.
There’s also the group of people (like me) who don’t particularly like it, for a bunch of quite valid reasons. The lack of being strongly typed is a huge barrier for me.
I am of the opinion that the ideal language with the ideal compiler would not let buggy code compile. It may not be as easy to program in this hypothetical language, but seeing as code has to exist and be debugged for order of magnitudes more time than it takes to write it, making it harder to write bugs is a good thing. After all, my experience with Python apps is that bugs manifest themselves at run time, to the user, rather than to the developer at the time of writing. Also, compiler error is better than unit test failure.
I am proud to announce OpenStack Live, a new annual conference that will run in parallel with the Percona Live MySQL Conference & Expo at the Santa Clara Convention Center in Silicon Valley. The inaugural event, OpenStack Live 2015, is April 13-14, 2015. We are lining up a strong Conference Committee and are now accepting tutorial and breakout session speaking proposals through November 9.
OpenStack Live will emphasize the essential elements of making OpenStack work better with emphasis on the critical role of MySQL and the value of Trove. You’ll hear about the hottest current topics, learn about operating a high-performing OpenStack deployment, and listen to top industry leaders describe the future of the OpenStack ecosystem. We are seeking speaking proposals on the following topics:
- Performance Optimization of OpenStack
- OpenStack Operations
- OpenStack Trove
- Replication and Backup for OpenStack
- High Availability for OpenStack
- OpenStack User Stories
- Monitoring and Tools for OpenStack
The conference features a full day of keynotes, breakout sessions, and Birds of a Feather sessions on April 14 preceded by an optional day of tutorials on April 13. A Monday reception will be held on the exhibit floor and joint lunches with both conferences offer you the opportunity to network with both the OpenStack and MySQL communities from both conferences. The OpenStack Live conference is a great event for users of any level.
As a bonus, OpenStack Live attendees may attend any Percona Live MySQL Conference session during the days of the OpenStack event. Conference only passes are available for April 14 and conference and tutorial passes are available for both April 13 and 14.
If you are using OpenStack and have a story to share – or a skill to teach – then now is the time to put pen to paper (or fingers to keyboard) and write your speaking proposal for either breakout or tutorial sessions (or both). Submissions will be reviewed by the OpenStack Live Conference Committee, which includes:
- Mark Atwood: Director – Open Source Evangelism for HP Cloud Services
- Rich Bowen: OpenStack Community Liaison at Red Hat
- Jason Rouault: Senior Director OpenStack Cloud at Time Warner Cable
- Peter Boros: Principal Architect at Percona
Presenting at OpenStack Live 2015 is your chance to put your ideas, case studies, best practices and technical knowledge in front of an intelligent, engaged audience of OpenStack users. If selected as a speaker by our Conference Committee, you will receive a complimentary full conference pass.
Public speaking not your thing or just want to learn about the latest and greatest OpenStack technologies, deployments and projects? Then register now and save big with our early bird discount. OpenStack Live 2015 is an ideal opportunity for organizations to connect with the community of OpenStack enthusiasts from Silicon Valley and around the world. The Percona Live MySQL Conference this past April had over 1,100 registered attendees from 40 countries and the OpenStack Open Source Appreciation Day on the Monday before the conference was fully booked so don’t delay, register today to save your seat!
We are currently accepting sponsors. You can learn more about sponsorship opportunities here.
I hope to see you at OpenStack Live 2015 next April! And speakers, remember the deadline to submit your proposals is November 9. In the meantime you can learn more by visiting the official OpenStack Live 2015 website.
The post OpenStack Live 2015: Call for speakers open through November 9 appeared first on MySQL Performance Blog.
In an interesting move that seems to be predominantly an acquihire, HP has bought Eucalyptus for an undisclosed sum, though speculation is that the deal’s under $100m, less than a 2x multiple on what Eucalyptus has raised in funding (although that would still be a huge multiple on revenue).
Out of this, HP gets Eucalyptus’s CEO, Marten Mickos, who will be installed as the head of HP’s cloud business, reporting to Meg Whitman. It also gets Eucalyptus’s people, including its engineering staff, whom they believe to really have expertise in what HP termed (in a discussion with myself and a number of other Gartner colleagues) the “Amazon architectural pattern”. Finally, it gets Eucalyptus’s software, although this seems to have been very secondary to the people and their know-how — unsurprising given HP’s commitment to OpenStack at the core of HP Helion.
Eucalyptus will apparently be continuing onward within HP. Mickos had indicated something of a change in direction previously, when he explained in a blog post why he would be keynoting an OpenStack conference. It seems like Eucalyptus had been headed in the direction of being an open-source cloud management platform (CMP) that provides an AWS API-compatible framework over a choice of underlying components, including OpenStack component options. In this context, it makes sense to have a standalone Eucalyptus product / add-on, providing an AWS-compatible private cloud software option to customers for whom this is important — and it sidesteps the OpenStack community debate on whether or not AWS compatibility should be important within OpenStack itself.
HP did not answer my direct question if Eucalyptus’s agreement with Amazon includes a change-of-control clause, but they did say that partnerships require ongoing collaboration between the two parties. I interpreted that to mean that AWS has some latitude to determine what they do here. The existing partnership has been an API licensing deal — specifically, AWS has provided Eucalyptus with engineering communications around their API specifications, without any technology transfer or documentation. The partnership been important to ensuring that Eucalyptus replicates AWS behavior as closely as possible, so the question of whether AWS continues to partner going forward is likely important to the fidelity of future Eucalyptus work.
It’s important to note that Eucalyptus is by no means a full AWS clone. It offers the EC2, S3, and IAM APIs, including relatively full support for EC2 features such as EBS. However, it does not support the VPC networking features. And of course, it’s missing the huge array of value-added capabilities that surround the basic compute and storage resources. It’s not as if HP or anyone else is going to take Eucalyptus and build a service that is seriously competitive to AWS. Eucalyptus had mostly found its niche serving SMBs who wanted to run a CMP that would support the most common AWS compute capabilities, either in a hybrid cloud mode (i.e., for organizations still doing substantial things in AWS) or as an on-prem alternative to public cloud IaaS.
Probably importantly to the future success of HP Helion and OpenStack, though, Mickos’s management tenure at Eucalyptus included turning the product from its roots as a research project, into much slicker commercial software that was relatively easy to install and run, without requiring professional services for implementation. He also turned its sales efforts to focus on SMBs with a genuine cloud agility desire, rather than chasing IT operations organizations looking for a better virtualization mousetrap (another example of bimodal IT thinking). Eucalyptus met with limited commercial success — but thus far, CloudStack and OpenStack haven’t fared much better. This has been, at least in part, a broader issue with the private cloud market and the scope of capabilities of the open-source products.
Of the many leaders that HP could have chosen for its cloud division, the choice of Mickos is an interesting one; he’s best known for being CEO of MySQL and eventually selling it to Sun, and thus he makes most sense as a leader in the context of open-source-oriented thinking. I’m not inclined to call the HP-Eucalyptus acquisition a game-changer, but I do think it’s an interesting indicator of HP’s thinking — although it perhaps further muddies waters that are already pretty muddy. The cloud strategies of IBM, Microsoft, Oracle, and VMware, for instance, are all very clear to me. HP hasn’t reached that level of crispness, even if they insist that they’ve got a plan and are executing on it.
Edit: Marten Mickos contacted in me in email to clarify the Amazon/Eucalyptus partnership, and to remind me that MySQL was sold to Sun, not Oracle. I’ve made the corrections.
Percona XtraBackup enables backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, Percona XtraBackup drives down backup costs while providing unique features for MySQL backup.
- Percona XtraBackup has implemented support for Galera GTID auto-recovery. Percona XtraBackup retrieves the GTID information, after backing up a server with backup locks support, from the InnoDB trx header on recovery and creates the xtrabackup_galera_info during that stage.
- Percona XtraBackup is now built with system zlib library instead of the older bundled one. Bug fixed #1108016.
- apt-get source was downloading older version of Percona XtraBackup. Bug fixed #1363259.
- innobackupex would ignore the innobackupex --databases without innobackupex --stream option and back up all the databases. Bug fixed #569387.
- rsync package wasn’t a dependency although it is required for the innobackupex --rsync option. Bug fixed #1259436.
- innobackupex --galera-info was checking only for non-capitalized wsrep_* status variables which was incompatible with MariaDB Galera Cluster 10.0. Bug fixed #1306875.
- Percona XtraBackup would crash trying to remove absent table from InnoDB data dictionary while preparing a partial backup. Bug fixed #1340717.
- Percona XtraBackup now supports MariaDB GTID. Bugs fixed #1329539 and #1326967 (Nirbhay Choubey).
- MariaDB 10.1 is now added to the list of supported servers. Bug fixed #1364398.
- Percona XtraBackup would fail to restore (copy-back) tables that have partitions with their own tablespace location. Bug fixed #1322658.
I’ll have the pleasure to present, next Wednesday, September 17 at 10 a.m. PDT (1 p.m. EDT) a webinar titled “OpenStack: A MySQL DBA Perspective.” Everyone is invited.
The webinar will be divided into two parts. The first part will cover how MySQL can be used by the OpenStack infrastructure including the expected load, high-availability solutions and geo-DR.
The second part will focus on the use of MySQL within an OpenStack cloud. We’ll look into the various options that are available, the traditional ones and Trove. We’ll also discuss the block device options in regards with MySQL performance and, finally, we’ll review the high-availability implications of running MySQL in an OpenStack cloud.
Register here. I look forward to your questions, and if you have any related to OpenStack that I can help with in advance of the webinar please feel free to post those in the comments section below. I’ll write a followup post after the webinar to recap all related questions and answers. I’ll also provide the slides.
See you next Wednesday!
The post OpenStack: A MySQL DBA Perspective – Sept. 17 webinar appeared first on MySQL Performance Blog.
I was fortunate to attend an Ops discussion about databases at the OpenStack Summit Atlanta this past May as one of the panelists. The discussion was about deadlock issues OpenStack operators see with Percona XtraDB Cluster (of course this is applicable to any Galera-based solution). I asked to describe what they are seeing, and as it turned out, nova and neutron uses the SELECT … FOR UPDATE SQL construct quite heavily. This is a topic I thought was worth writing about.Write set replication in a nutshell (with oversimplification)
Any node is writable, and replication happens in write sets. A write set is practically a row based binary log event or events and “some additional stuff.” The “some additional stuff” is good for 2 things.
- Two write sets can be compared and told if they are conflicting or not.
- A write set can be checked against a database if it’s applicable.
Before committing on the originating node, the write set is transferred to all other nodes in the cluster. The originating node checks that the transaction is not conflicting with any of the transactions in the receive queue and checks if it’s applicable to the database. This process is called certification. After the write set is certified the transaction is committed. The remote nodes will do certification asynchronously compared to the local node. Since the certification is deterministic, they will get the same result. Also the write set on the remote nodes can be applied later because of this reason. This kind of replication is called virtually synchronous, which means that the data transfer is synchronous, but the actual apply is not.
We have a nice flowchat about this.
Since the write set is only transferred before commit, InnoDB row level locks, which are held locally, are not held on remote nodes (if these were escalated, each row lock would take a network round trip to acquire). This also means that by default if multiple nodes are used, the ability to read your own writes is not guaranteed. In that case, a certified transaction, which is already committed on the originating node can still sit in the receive queue of the node the application is reading from, waiting to be applied.SELECT … FOR UPDATE
The SELECT … FOR UPDATE construct reads the given records in InnoDB, and locks the rows that are read from the index the query used, not only the rows that it returns. Given how write set replication works, the row locks of SELECT … FOR UPDATE are not replicated.Putting it together
Let’s create a test table.CREATE TABLE `t` ( `id` int(11) NOT NULL AUTO_INCREMENT, `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1
And some records we can lock.pxc1> insert into t values(); Query OK, 1 row affected (0.01 sec) pxc1> insert into t values(); Query OK, 1 row affected (0.01 sec) pxc1> insert into t values(); Query OK, 1 row affected (0.01 sec) pxc1> insert into t values(); Query OK, 1 row affected (0.00 sec) pxc1> insert into t values(); Query OK, 1 row affected (0.01 sec)pxc1> select * from t; +----+---------------------+ | id | ts | +----+---------------------+ | 1 | 2014-06-26 21:37:01 | | 4 | 2014-06-26 21:37:02 | | 7 | 2014-06-26 21:37:02 | | 10 | 2014-06-26 21:37:03 | | 13 | 2014-06-26 21:37:03 | +----+---------------------+ 5 rows in set (0.00 sec)
On the first node, lock the record.pxc1> start transaction; Query OK, 0 rows affected (0.00 sec) pxc1> select * from t where id=1 for update; +----+---------------------+ | id | ts | +----+---------------------+ | 1 | 2014-06-26 21:37:01 | +----+---------------------+ 1 row in set (0.00 sec)
On the second, update it with an autocommit transaction.pxc2> update t set ts=now() where id=1; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 pxc1> select * from t; ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
Let’s examine what happened here. The local record lock held by the started transation on pxc1 didn’t play any part in replication or certification (replication happens at commit time, there was no commit there yet). Once the node received the write set from pxc2, that write set had a conflict with a transaction still in-flight locally. In this case, our transaction on pxc1 has to be rolled back. This is a type of conflict as well, but here the conflict is not caught on certification time. This is called a brute force abort. This happens when a transaction done by a slave thread conflict with a transaction that’s in-flight on the node. In this case the first commit wins (which is the already replicated one) and the original transaction is aborted. Jay Janssen discusses multi-node writing conflicts in detail in this post.
The same thing happens when 2 of the nodes are holding record locks via select for update. Whichever node commits first will win, the other transaction will hit the deadlock error and will be rolled back. The behavior is correct.
Here is the same SELECT … FOR UPDATE transaction overlapping on the 2 nodes.pxc1> start transaction; Query OK, 0 rows affected (0.00 sec) pxc2> start transaction; Query OK, 0 rows affected (0.00 sec)pxc1> select * from t where id=1 for update; +----+---------------------+ | id | ts | +----+---------------------+ | 1 | 2014-06-26 21:37:48 | +----+---------------------+ 1 row in set (0.00 sec) pxc2> select * from t where id=1 for update; +----+---------------------+ | id | ts | +----+---------------------+ | 1 | 2014-06-26 21:37:48 | +----+---------------------+ 1 row in set (0.00 sec)pxc1> update t set ts=now() where id=1; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 pxc2> update t set ts=now() where id=1; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0pxc1> commit; Query OK, 0 rows affected (0.00 sec) pxc2> commit; ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transactionWhere does this happen in OpenStack?
For example in OpenStack Nova (the compute project in OpenStack), tracking the quota usage uses the SELECT…FOR UPDATE construct.# User@Host: nova[nova] @ [10.10.10.11] Id: 147 # Schema: nova Last_errno: 0 Killed: 0 # Query_time: 0.001712 Lock_time: 0.000000 Rows_sent: 4 Rows_examined: 4 Rows_affected: 0 # Bytes_sent: 1461 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # InnoDB_trx_id: C698 # QC_Hit: No Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: No Filesort_on_disk: No Merge_passes: 0 # InnoDB_IO_r_ops: 0 InnoDB_IO_r_bytes: 0 InnoDB_IO_r_wait: 0.000000 # InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000 # InnoDB_pages_distinct: 2 SET timestamp=1409074305; SELECT quota_usages.created_at AS quota_usages_created_at, quota_usages.updated_at AS quota_usages_updated_at, quota_usages.deleted_at AS quota_usages_deleted_at, quota_usages.deleted AS quota_usages_deleted, quota_usages.id AS quota_usages_id, quota_usages.project_id AS quota_usages_project_id, quota_usages.user_id AS quota_usages_user_id, quota_usages.resource AS quota_usages_resource, quota_usages.in_use AS quota_usages_in_use, quota_usages.reserved AS quota_usages_reserved, quota_usages.until_refresh AS quota_usages_until_refresh FROM quota_usages WHERE quota_usages.deleted = 0 AND quota_usages.project_id = '12ce401aa7e14446a9f0c996240fd8cb' FOR UPDATE;So where does it come from?
These constructs are generated by SQLAlchemy using with_lockmode(‘update’). Even in nova’s pydoc, it’s recommended to avoid with_lockmode(‘update’) whenever possible. Galera replication is not mentioned among the reasons to avoid this construct, but knowing how many OpenStack deployments are using Galera for high availability (either Percona XtraDB Cluster, MariaDB Galera Cluster, or Codership’s own mysql-wsrep), it can be a very good reason to avoid it. The solution proposed in the linked pydoc above is also a good one, using an INSERT INTO … ON DUPLICATE KEY UPDATE is a single atomic write, which will be replicated as expected, it will also keep correct track of quota usage.
The simplest way to overcome this issue from the operator’s point of view is to use only one writer node for these types of transactions. This usually involves configuration change at the load-balancer level. See this post for possible load-balancer configurations.
The post OpenStack users shed light on Percona XtraDB Cluster deadlock issues appeared first on MySQL Performance Blog.
There are a lot of tools that generate test data. Many of them have complex XML scripts or GUI interfaces that let you identify characteristics about the data. For testing query performance and many other applications, however, a simple quick and dirty data generator which can be constructed at the MySQL command line is useful.
First, let’s talk about what kind of data you can easily create with MySQL function calls:
You can generate a decimal number between zero and another number using the MySQL RAND() function like the following query (here between 0 and 10000):SELECT RAND() * 10000;
Similarly, you can generate a random integer by adding the FLOOR() function:SELECT FLOOR(RAND() * 10000)
You can generate a random string of 32 characters using MD5():SELECT MD5(RAND() * 10000)
You can return a random integer between 500 and 1000 with the following:SELECT FLOOR( 500 + RAND() * (1000 - 500))
You can return a random string from a list of strings by using a table to hold the list. A subselect can select a random name from the list of names.create table names(id int auto_increment primary key, name varchar(20)); insert into names (name) values ('Justin','Jerry','James','Josh','Julien'); select (select name from names where id = 1 + rand() * 4);
Now we can generate a “fact” table with many rows using fairly simple SQL statements.
First create a table to generate data into:CREATE TABLE fact ( dim1 int, dim2 int, name varchar(20), hash varchar(32), measure1 double );
Seed the table with one initial row:INSERT INTO fact VALUES (1,1,'Justin',md5(''), .1);
Now grow the table by selecting from the table but providing new random values for the inserted rows:INSERT INTO fact SELECT FLOOR(1+ rand()*9999), FLOOR(1 + rand()*499), (select name from names where id = 1 + rand() * 4), MD5(1+rand()*9999), rand() FROM fact;
As you repeat the INSERT … SELECT, the table will grow exponentially. You may want to add a LIMIT clause to the INSERT … SELECT to reduce the amount of data generated as the table grows.
You will create a table with an even data distribution for each column. You can then add some queries to add skew, either using INSERT … SELECT or UPDATE, for example:INSERT INTO fact SELECT 1,1,'Justin',md5(''), .1 FROM fact LIMIT 10000;
That will skew the values by creating many rows with the same data as our initial row.
Using these simple tools, you can generate a data set that is great for testing purposes. For example, dim1 might be a customer_id and dim2 a product_id, and you would populate those tables with 10000 and 500 rows, respectively.
A common migration path from standalone MySQL/Percona Server to a Percona XtraDB Cluster (PXC) environment involves some measure of time where one node in the new cluster has been configured as a slave of the production master that the cluster is slated to replace. In this way, the new cluster acts as a slave of the production environment – traditional replication takes care of getting the data into the cluster, and then Galera replication handles the intra-cluster traffic. This often works without issue, although there is one case that I’ve encountered recently where special care must be taken to properly configure the stream to ensure that replication does not break. If you use multi-valued inserts with auto-increment columns, then this post is for you.
For purposes of our discussion, assume that we have a basic 3-node PXC cluster that we’ve set up using the PXC Reference Architecture document, and that we’re replicating from an asynchronous master (call it “server A”) into one of the PXC nodes. Without loss of generality, we’ll pick PXC03. Also, for purposes of our discussion, we’ll be working with the following table definition:serverA> show create table auto_inc_test; CREATE TABLE `auto_inc_test` ( `i` int(11) NOT NULL AUTO_INCREMENT, `stuff` varchar(20) DEFAULT NULL, PRIMARY KEY (`i`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) serverA> SELECT * FROM auto_inc_test; Empty set (0.00 sec)
If we insert rows into this table one at a time, we have no issues.serverA> INSERT INTO auto_inc_test(stuff) VALUES ('first row'); serverA> INSERT INTO auto_inc_test(stuff) VALUES ('second row'); serverA> INSERT INTO auto_inc_test(stuff) VALUES ('third row'); serverA> SELECT * FROM auto_inc_test; +---+------------+ | i | stuff | +---+------------+ | 1 | first row | | 2 | second row | | 3 | third row | +---+------------+ PXC03> SELECT * FROM auto_inc_test; +---+------------+ | i | stuff | +---+------------+ | 1 | first row | | 2 | second row | | 3 | third row | +---+------------+
But if we start doing multi-valued inserts, we can run into a problem.serverA> INSERT INTO auto_inc_test(stuff) VALUES('first row'),('second row'),('third row'); serverA> INSERT INTO auto_inc_test(stuff) VALUES('fourth row'),('fifth row'); serverA> SELECT * FROM auto_inc_test; +---+------------+ | i | stuff | +---+------------+ | 1 | first row | | 2 | second row | | 3 | third row | | 4 | fourth row | | 5 | fifth row | +---+------------+ PXC03> SELECT * FROM auto_inc_test; +---+------------+ | i | stuff | +---+------------+ | 1 | first row | | 2 | second row | | 5 | third row | +---+------------+ PXC03> SHOW SLAVE STATUS; ... output elided ... Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '5' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'INSERT INTO auto_inc_test (stuff) VALUES ('fourth row'),('fifth row')' ... output elided ...
Uh oh. Replication is broken and our data is now inconsistent. So why does this happen and how can we prevent it?binlog_format
The astute observer will note that I have not yet said anything about the binary log format on the master. If the binary log format on the master is already set to ROW, then the above error will not occur. RBR will properly replicate multi-valued INSERTs to the PXC cluster without issue, and the data will be consistent. Problem solved. However, there may be reasons that the master is not using or cannot use RBR, such as disk space or IOPS limitations, and thus it’s running in MIXED or STATEMENT mode. In that case, we need to look elsewhere….wsrep_auto_increment_control
When set to ON (the default), this variable has the effect of automatically specifying values for auto_increment_increment and auto_increment_offset based on the cluster size. The idea behind it is to help prevent auto-increment value conflicts when writing to multiple nodes. However, what it also means is that in a multi-node cluster, the auto-increment values generated by any given node will never be consecutive and the “next” auto-increment value on the slave cluster node will always be higher than what the master believes it should be. For example:serverA> INSERT INTO auto_inc_test (stuff) VALUES ('first row'),('second row'),('third row'); serverA> SHOW CREATE TABLE auto_inc_test; CREATE TABLE `auto_inc_test` ( `i` int(11) NOT NULL AUTO_INCREMENT, `stuff` varchar(20) DEFAULT NULL, PRIMARY KEY (`i`) ) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1 PXC03> SHOW CREATE TABLE auto_inc_test; CREATE TABLE `auto_inc_test` ( `i` int(11) NOT NULL AUTO_INCREMENT, `stuff` varchar(20) DEFAULT NULL, PRIMARY KEY (`i`) ) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1
Hmm, that’s not good.
The above scenario can be avoided by setting wsrep_auto_increment_control to OFF on the node that’s acting as a slave while the cluster is still receiving asynchronous replication traffic. This can be configured in /etc/my.cnf or set dynamically from the MySQL command line with:SET GLOBAL wsrep_auto_increment_control='OFF';
In the testing that I’ve done, this appears to be sufficient to deal with the problem, even if the original master has non-standard values specified for the auto_increment_* variables (e.g., because it’s part of a master-master pair where the odd ID numbers are generated by one server and the even ID numbers from the other one).
In fact, if the cluster is always going to be used in single-writer mode, there’s a compelling argument to be made for setting this variable to OFF on all of the cluster nodes even when going into production: it will prevent the cluster from burning through N auto-increment IDs every time a single row is inserted.Fixing it when it’s broken
There’s no secret magic here. Percona Toolkit’s pt-table-checksum and pt-table-sync can be used to check and repair the data divergence between the master and the slave cluster node, just as if PXC were not involved, although you may find that it’s just faster to rebuild the cluster, depending upon how many diffs are found. Sometimes this issue happens right away when the master-to-cluster replication is started, and fixing it involves just one or two rows; other times I have seen it not occur for days with lots of rows out of sync once it finally breaks.The tl;dr version
In sum, if you’re doing traditional replication into a PXC cluster (or any other flavor of MySQL/MariaDB + Galera), you may have issues with multi-valued INSERT statements, and this can be prevented with a configuration change on either side of the replication stream:
- On the master, set binlog_format=ROW
- On the PXC slave node, set wsrep_auto_increment_control=OFF
The post Multi-Valued INSERTs, AUTO_INCREMENT & Percona XtraDB Cluster appeared first on MySQL Performance Blog.
When a write query is sent to Percona XtraDB Cluster all the nodes store the writeset on a file called gcache. By default the name of that file is galera.cache and it is stored in the MySQL datadir. This is a very important file, and as usual with the most important variables in MySQL, the default value is not good for high-loaded servers. Let’s see why it’s important and how can we calculate a correct value for the workload of our cluster.
What’s the gcache?
When a node goes out of the cluster (crash or maintenance) it obviously stops receiving changes. When you try to reconnect the node to the cluster the data will be outdated. The joiner node needs to ask a donor to send the changes happened during the downtime.
The donor will first try to transfer an incremental (IST), that is, the writesets the cluster received while the node was down. The donor checks the last writeset received by the joiner and then checks local gcache file. If all needed writesets are on that cache the donor sends them to the joiner. The joiner applies them and that’s all, it is up to date and ready to join the cluster. Therefore, IST can only be achieved if all changes missed by the node that went away are still in that gcache file of the donor.
On the other hand, if the writesets are not there a full transfer would be needed (SST) using one of the supported methods, XtraBackup, Rsync or mysqldump.
In a summary, the difference between a IST and SST is the time that a node needs to join the cluster. The difference could be from seconds to hours. In case of WAN connections and large datasets maybe days.
That’s why having a correct gcache is important. It work as a circular log, so when it is full it starts to rewrite the writesets at the beginning. With a larger gcache a node can be out of the cluster more time without requiring a SST. My colleague Jay Janssen explains in more detail about how IST works and how to find the right server to use as donor.
Calculating the correct size
When trick is pretty similar to the one used to calculate the correct InnoDB log file size. We need to check how many bytes are written every minute. The variables to check are:
wsrep_replicated_bytes: Total size (in bytes) of writesets sent to other nodes.
wsrep_received_bytes: Total size (in bytes) of writesets received from other nodes.mysql> show global status like 'wsrep_received_bytes'; show global status like 'wsrep_replicated_bytes'; select sleep(60); show global status like 'wsrep_received_bytes'; show global status like 'wsrep_replicated_bytes'; +----------------------+----------+ | Variable_name | Value | +----------------------+----------+ | wsrep_received_bytes | 83976571 | +----------------------+----------+ +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | wsrep_replicated_bytes | 0 | +------------------------+-------+ [...] +----------------------+----------+ | Variable_name | Value | +----------------------+----------+ | wsrep_received_bytes | 90576957 | +----------------------+----------+ +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | wsrep_replicated_bytes | 800 | +------------------------+-------+
Bytes per minute:
(second wsrep_received_bytes – first wsrep_received_bytes) + (second wsrep_replicated_bytes – first wsrep_replicated_bytes)
(90576957 – 83976571) + (800 – 0) = 6601186 bytes or 6 MB per minute.
Bytes per hour:
6MB * 60 minutes = 360 MB per hour of writesets received by the cluster.
If you want to allow one hour of maintenance (or downtime) of a node, you need to increase the gcache to that size. If you want more time, just make it bigger.
The post How to calculate the correct size of Percona XtraDB Cluster’s gcache appeared first on MySQL Performance Blog.
Based on Percona Server 5.5.39-36.0 including all the bug fixes in it, Galera Replicator 2.11, and on Codership wsrep API 25.11, Percona XtraDB Cluster 5.5.39-25.11 is now the current 5.5 General Availability release. All of Percona‘s software is open-source and free, and all the details of the release can be found in the 5.5.39-25.11 milestone at Launchpad.
- New session variable wsrep_sync_wait has been implemented to control causality check. The old session variable wsrep_causal_reads is deprecated but is kept for backward compatibility (#1277053).
- systemd integration with RHEL/CentOS 7 is now available for Percona XtraDB Cluster from our testing repository (#1342223).
- Percona XtraDB Cluster has implemented threadpool scheduling fixes. Bug fixed #1333348.
- When gmcast.listen_addr was configured to a certain address, local connection point for outgoing connections was not bound to listen address. This would happen if OS has multiple interfaces with IP addresses in the same subnet, it may happen that OS would pick wrong IP for local connection point and other nodes would see connections originating from IP address which was not listened to. Bug fixed #1240964.
- Client connections were closed unconditionally before generating SST request. Fixed by avoiding closing connections when wsrep is initialized before storage engines. Bug fixed #1258658.
- Issue with re-setting galera provider (in wsrep_provider_options) has been fixed. Bug fixed #1260283.
- Variable wsrep_provider_options couldn’t be set in runtime if no provider was loaded. Bug fixed #1260290.
- Node consistency issues with foreign keys have been fixed. This fix introduces two new variables: wsrep_slave_FK_checks and wsrep_slave_UK_checks. These variables are set to TRUE and FALSE respectively by default. They control whether Foreign Key and Unique Key checking is done for applier threads. Bug fixed #1260713.
- When FLUSH TABLES WITH READ LOCK was used on a node with wsrep_causal_reads set to 1 while there was a DML on other nodes then, subsequent SELECTs/SHOW STATUS didn’t hang earlier providing non-causal output, that has been fixed here. Bug fixed #1271177.
- Lowest group communication layer (evs) would fail to handle the situation properly when big number of nodes would suddenly start to see each other. Bugs fixed #1271918 and #1249805.
- Updating a unique key value could cause server hang if slave node has enabled parallel slaves. Bug fixed #1280896.
- Fixed the events replication inconsistencies. Bug fixed #1312618.
- Truncating the sorted version of multi-byte character conversion could lead to wsrep certification failures. Bug fixed #1314854.
- wsrep_slave_threads was counted towards max_connections which could cause ERROR 1040 (HY000): Too many connections error. Bug fixed #1315588.
- Leaving node was not set nonoperational if processed leave message originated from different view than the current one, which could cause other nodes to crash. Bug fixed #1323412 (#41).
- garbd couldn’t be started with init script on RHEL 6.5. Bug fixed #1323652.
- SST would fail when binlogs were in dedicated directory that’s located inside datadir. This bug was a regression introduced by bug fix for #1273368. Bug fixed #1326012.
- GTID of TOI operations is now also synced to InnoDB tablespace in order to get consistent backups. Bug fixed #1329055.
- mysql-debug (UNIV_DEBUG) is now distributed with binary tar.gz along with RPM and DEB packages. Bug fixed #1332073.
- The restart sequence in scripts/mysql.server would fail to capture and return if the start call failed to start the server, so a restart could occur that failed upon start-up, and the script would still return 0 as if it worked without any issues. Bug fixed #1339894.
- wsrep consistency check is now enabled for REPLACE ... SELECT as well. This was implemented because pt-table-checksum uses REPLACE .. SELECT during checksumming. Bug fixed #1343209.
- A memory leak in wsrep_mysql_parse function has been fixed. Bug fixed #1345023.
- SHOW STATUS was generating debug output in the error log. Bug fixed #1347818.
- percona-xtradb-cluster-garbd-3.x package was installed incorrectly on Debian/Ubuntu. Bugs fixed #1360633 and #1334530.
Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!
The post Percona XtraDB Cluster 5.5.39-25.11 is now available appeared first on MySQL Performance Blog.
Thank you to all of you who attended my webinar last week about Global Transaction IDs (GTIDs), which were introduced in MySQL 5.6 to make the reconfiguration of replication straightforward. If you missed my webinar, you can still listen to the recording and download the sides (free). We had a lot of questions during the webinar, so let me try to answer them here. Please let me know in the comments if additional clarification is needed.
Q: Does GTID provide any benefit to master-master replication? If yes, how?
Q: Is ACTIVE ACTIVE MASTER MASTER successful in MySQL with GTID?
A: GTIDs don’t change the nature of MySQL replication: for instance it is still asynchronous and if you write on both masters in a master-master setup, there is still no write conflict detection mechanism. The main benefit of GTIDs is that any change of the replication topology is easy because you no longer need to run complex calculations to find the right binlog positions when connecting a slave to a new master.
So master-master replication can be configured with GTIDs, it does not provide a lot of benefits compared to position-based replication as you will never change the replication topology.
And having a setup where both masters receive writes is still not recommended with GTIDs.
Q: Will GTIDs work well with master:standby master? How quick would the failover be?
A: Yes, GTIDs works well with this kind of setup (which is one master and one slave). GTIDs do not provide failover, you will have to use an external tool. Speed of failover then depends on the tool you will use.
Q: For already set up MASTER-MASTER/MASTER-SLAVE Replication, after getting GTID set up, we need to rebuild replication again using AUTO POS=1, correct?
A: Yes, using MASTER_AUTO_POSITION=1 is necessary to indicate that you will use GTID replication. So you will have to run: STOP SLAVE; CHANGE MASTER TO … MASTER_AUTO_POSITION = 1; START SLAVE;
Q: Application having tables from different Engines(InnoDB and MyISAM), how that will handled in GTID?
A: Transactions using both MyISAM and InnoDB tables are not allowed, please refer to the documentation
Q: In a master-slave replication topology (with GTID enabled), how does slave get data from the master if the master’s binary logs are purged given that AUTO_POSITION=1 is used as part of the change master command?
A: This will break replication with error 1236.
Q: Whats the value of show slave status who determines if there is a lag on the slave?
A: This is Seconds_Behind_Master. It’s not always reliable though. For instance if you have a replication setup like A -> B -> C, Seconds_Behind_Master on C will shop the lag relatively to B, not A.
Q: What is the value of saving the history of previous master’s GTIDs executed in the show slave status -> Executed_Gtid_Set?
A: The new replication protocol makes sure that when the slave connects to its master, it sends the range of GTIDs it has already executed. Then the master sends back all other transactions. That’s why Executed_Gtid_Set contains the history of all executed transactions.
Q: We use DB Master and Slave VIPs on our servers, can the mysqlfailover tool also switch the VIP to the new master? Is it scriptable on the event of a failover?
A: Yes you can use extension points to add you own custom scripts with mysqlfailover. See the documentation for –exec-before and –exec-after.
Q: How does mysqlfailover handle brief network instability between the Master and Slaves?
A: mysqlfailover only triggers failover when it suspects the master is no longer alive. So network instability between the master and its slaves won’t affect it for master crash detection. However it can prevent the tool from reconfiguring replication correctly during failover/switchover if one or several slaves are not reachable.
Q: Does Facebook use MySQL with GTID? if yes, which module or all together for everything?
A: I can’t speak for Facebook, but this talk at the MySQL Conference this year suggests that they’re using GTIDs in production. They have added custom code to make GTIDs easier to use.
Q: is GTID_SUBSET function part of MySQL utilities? or we should set a script to regularly detect it? is GTID_SUBSET beneficial in case of an ACTIVE ACTIVE MASTER MASTER setup?
Q: Can you please confirm how to get gtid set in order to use gtid functions?
A: GTID_SUBSET() is a built-in function in MySQL 5.6, you don’t need to install MySQL Utilities to use it. It can be used to easily know whether Executed_Gtid_Set on a given server is a subset of Executed_Gtid_Set on another server, so it can be beneficial to use it in any replication topology.
Q: What is difference between HOLE and BUGS?
A: Holes are not allowed in MySQL 5.6 implementation of GTIDs. So if you see a hole in a GTID sequence, you’re hitting a bug!
Q: Using MySQL utilities, we can set-up replication also using a python script with just one command. Does it automatically takes dump from master to slave and starts replication? If yes so a 300 GB data directory, will it run as background if executed using shell script? Or it just starts the replicationn from current position and won’t take the dump?
A: I think you are talking about mysqlreplicate. This tool only runs CHANGE MASTER TO for you so it doesn’t take a backup of any kind.
Q: Is it possible to use mysqlfailover script at any node(like slave) any time to know which is its MASTER and other SLAVE options also? If no, is this available by some other means?
A: You should probably use mysqlrplshow instead.
Q: As told during limitations of MySQL Utilities on automatic failover, so how can I achieve AUTOMATIC failover if I want this as primary option?
A: The node running mysqlfailover is not highly available so if it is down you lose the ability of doing automatic failover. There are several options if you want to achieve automatic failover: carefully monitor the monitoring node or use solutions like Percona Replication Manager which relies on Pacemaker or Percona XtraDB Cluster which relies on Galera replication.
Q: mysqlrpadmin failover/switchover: Can we execute this command on slave or any other monitoring node?
A: Yes, as long as mysqlrpladmin is installed on a server and if it can connect to the database servers, the command can be executed from anywhere.
Q: If we set slaves to read_only…is that recommended to eliminate the errant transaction?
A: Yes, it is recommended. However it doesn’t prevent users with the SUPER privilege from accidentally writing on a slave.
Q: Can this errant transaction issue be prevented with (active-passive) master-master replication?
A: Master-master replication will make sure that any transaction written on one server will automatically end up being written on the other slave. So it is like avoiding errant transactions. However writing on both masters is not recommended as you can have write conflicts.
Q: Is there any practical advantage of using GTID with MHA?
A: GTIDs provide no way to perform failover, they only simplify how you can reconfigure replication. So using MHA to leverage GTIDs makes sense.
Q: What is diifference GUID and GTID? When do we have to use GUID and GTID?
A: I’m not sure I correctly understand the question. A GTID is made of a source id and a transaction id. The source id is the master’s server_uuid, which is a GUID that is automatically generated when MySQL is started for the first time.
Q: How to check the slave database tables and record? we have to sync data from master to salve database or automatically will be happen?
A: You can use pt-table-checksum and pt-table-sync from Percona Toolkit.
Thanks again for attending the webinar! You can replay it, download the slide – and also access Percona’s vast library of other MySQL webinar recordings here.
The post Using MySQL 5.6 Global Transaction IDs (GTIDs) in production: Q&A appeared first on MySQL Performance Blog.
Daniel was tracking down what appeared to be a networking problem….
- server reported 113 (No route to host)
- However, an strace did not reveal the networking stack ever returning that.
- On the other side, IP packets were actually received.
- When confronted with mysteries like this, I get suspicious – mainly of (fellow) programmers.
- I suggested a grep through the source code, which revealed return -EHOSTUNREACH;
- Mystery solved, which allowed us to find what was actually going on.
- Don’t just believe or presume the supposed origin of an error.
- Programmers often take shortcuts that cause grief later. I fully appreciate how the above code came about, but I still think it was wrong. Mapping a “similar” situation onto an existing error code is convenient. But when an error occurs, the most important thing is for people to be able to track down what the root cause is. Reporting this error outside of its original context (error code reported by network stack) is clearly unhelpful, it actually misdirects and requires people to essentially waste time to track it down (as above).
- Horay once again for Open Source, which makes it so much easier to figure these things out. While possibly briefly embarrassing for the programmer, more eyes allows code to improve better and faster – and, perhaps, also entices towards better coding practices from the outset (I can hope!).
What do you think?
Here on the Percona Support team we often ask customers to retrieve disk stats to monitor disk IO and to measure block devices iops and latency. There are a number of tools available to monitor IO on Linux. iostat is one of the popular tools and Percona Toolkit, which is free, contains the pt-diskstats tool for this purpose. The pt-diskstats tool is similar to iostat but it’s more interactive and contains extended information. pt-diskstats reports current disk activity and shows the statistics for the last second (which by default is 1 second) and will continue until interrupted. The pt-diskstats tool collects samples of /proc/diskstats.
In this post, I will share some examples about how to monitor and check to see if the IO subsystem is performing properly or if any disks are a limiting factor – all this by using the pt-diskstats tool.
pt-diskstats output consists on number of columns and in order to interpret pt-diskstats output we need to know what each column represents.
- rd_s tells about number of reads per second while wr_s represents number of writes per second.
- rd_rt and wr_rt shows average response time in milliseconds for reads & writes respectively, which is similar to iostat tool output await column but pt-diskstats shows individual response time for reads and writes at disk level. Just a note, modern iostat splits read and write latency out, but most distros don’t have the latest iostat in their systat (or equivalent) package.
- rd_mrg and wr_mrg are other two important columns in pt-diskstats output. *_mrg is telling us how many of the original operations the IO elevator (disk scheduler) was able to merge to reduce IOPS, so *_mrg is telling us a quite important thing by letting us know that the IO scheduler was able to consolidate many or few operations. If rd_mrg/wr_mrg is high% then the IO workload is sequential on the other hand, If rd_mrg/wr_mrg is a low% then IO workload is all random. Binary logs, redo logs (aka ib_logfile*), undo log and doublewrite buffer all need sequential writes.
- qtime and stime are last two columns in pt-diskstats output where qtime reflects to time spent in disk scheduler queue i.e. average queue time before sending it to physical device and on the other hand stime is average service time which is time accumulated to process the physical device request. Note, that qtime is not discriminated between reads and writes and you can check if response time is higher for qtime than it signal towards disk scheduler. Also note that service time (stime field and svctm field in in pt-diskstats & iostat output respectively) is not reliable on Linux. If you read the iostat manual you will see it is deprecated.
Along with that, there are many other parameters for pt-diskstats – you can found full documentation here. Below is an example of pt-disktats in action. I used the –devices-regex option which prints only device information that matches this Perl regex.$ pt-diskstats --devices-regex=sd --interval 5 #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.1 sda 21.6 22.8 0.5 45% 1.2 29.4 275.5 4.0 1.1 0% 40.0 145.1 65% 158 297.1 155.0 2.1 1.1 sdb 15.0 21.0 0.3 33% 0.1 5.2 0.0 0.0 0.0 0% 0.0 0.0 11% 1 15.0 0.5 4.7 1.1 sdc 5.6 10.0 0.1 0% 0.0 5.2 1.9 6.0 0.0 33% 0.0 2.0 3% 0 7.5 0.4 3.6 1.1 sdd 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 5.0 sda 17.0 14.8 0.2 64% 3.1 66.7 404.9 4.6 1.8 14% 140.9 298.5 100% 111 421.9 277.6 1.9 5.0 sdb 14.0 19.9 0.3 48% 0.1 5.5 0.4 174.0 0.1 98% 0.0 0.0 11% 0 14.4 0.9 2.4 5.0 sdc 3.6 27.1 0.1 61% 0.0 3.5 2.8 5.7 0.0 30% 0.0 2.0 3% 0 6.4 0.7 2.4 5.0 sdd 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0
These are the stats from 7200 RPM SATA disks. As you can see, the write-response time is very high and most of that is made up of IO queue time. This shows the problem exactly. The problem is that the IO subsystem is not able to handle the write workload because the amount of writes that are being performed are way beyond what it can handle. It means the disks cannot service every request concurrently. The workload would actually depend a lot on where the hot data is stored and as we can see in this particular case the workload only hits a single disk out of the 4 disks. A single 7.2K RPM disk can only do about 100 random writes per second which is not a lot considering heavy workload.
It’s not particularly a hardware issue but a hardware capacity issue. The kind of workload that is present and the amount of writes that are performed per second are not something that the IO subsystem is able to handle in an efficient manner. Mostly writes are generated on this server as can be seen by the disk stats.
Let me show you a second example. Here you can see read latency. rd_rt is consistently between 10ms-30ms. It depends on how fast the disks are spinning and the number of disks. To deal with it possible solutions would be to optimize queries to avoid table scans, use memcached where possible, use SSD’s as it can provide good I/O performance with high concurrency. You will find this post useful on SSD’s from our CEO, Peter Zaitsev.#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.0 sdb 33.0 29.1 0.9 0% 1.1 34.7 7.0 10.3 0.1 61% 0.0 0.4 99% 1 40.0 2.2 19.5 1.0 sdb1 0.0 0.0 0.0 0% 0.0 0.0 7.0 10.3 0.1 61% 0.0 0.4 1% 0 7.0 0.0 0.4 1.0 sdb2 33.0 29.1 0.9 0% 1.1 34.7 0.0 0.0 0.0 0% 0.0 0.0 99% 1 33.0 3.5 30.2 1.0 sdb 81.9 28.5 2.3 0% 1.1 14.0 0.0 0.0 0.0 0% 0.0 0.0 99% 1 81.9 2.0 12.0 1.0 sdb1 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 1.0 sdb2 81.9 28.5 2.3 0% 1.1 14.0 0.0 0.0 0.0 0% 0.0 0.0 99% 1 81.9 2.0 12.0 1.0 sdb 50.0 25.7 1.3 0% 1.3 25.1 13.0 11.7 0.1 66% 0.0 0.7 99% 1 63.0 3.4 11.3 1.0 sdb1 25.0 21.3 0.5 0% 0.6 25.2 13.0 11.7 0.1 66% 0.0 0.7 46% 1 38.0 3.2 7.3 1.0 sdb2 25.0 30.1 0.7 0% 0.6 25.0 0.0 0.0 0.0 0% 0.0 0.0 56% 0 25.0 3.6 22.2
From the below diskstats output it seems that IO is saturated between both reads and writes. This can be noticed with high value for columns rd_s and wr_s. In this particular case, consider having disks in either RAID 5 (better for read only workload) or RAID 10 array is good option along with battery-backed write cache (BBWC) as single disk can really be bad for performance when you are IO bound.device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime sdb1 362.0 27.4 9.7 0% 2.7 7.5 525.2 20.2 10.3 35% 6.4 8.0 100% 0 887.2 7.0 0.9 sdb1 439.9 26.5 11.4 0% 3.4 7.7 545.7 20.8 11.1 34% 9.8 11.9 100% 0 985.6 9.6 0.8 sdb1 576.6 26.5 14.9 0% 4.5 7.8 400.2 19.9 7.8 34% 6.7 10.9 100% 0 976.8 8.6 0.8 sdb1 410.8 24.2 9.7 0% 2.9 7.1 403.1 18.3 7.2 34% 10.8 17.7 100% 0 813.9 12.5 1.0 sdb1 378.4 24.6 9.1 0% 2.7 7.3 506.1 16.5 8.2 33% 5.7 7.6 100% 0 884.4 6.6 0.9 sdb1 572.8 26.1 14.6 0% 4.8 8.4 422.6 17.2 7.1 30% 1.7 2.8 100% 0 995.4 4.7 0.8 sdb1 429.2 23.0 9.6 0% 3.2 7.4 511.9 14.5 7.2 31% 1.2 1.7 100% 0 941.2 3.6 0.9
The following example reflects write heavy activity but write-response time is very good, under 1ms, which shows disks are healthy and capable of handling high number of IOPS.#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.0 dm-0 530.8 16.0 8.3 0% 0.3 0.5 6124.0 5.1 30.7 0% 1.7 0.3 86% 2 6654.8 0.2 0.1 2.0 dm-0 633.1 16.1 10.0 0% 0.3 0.5 6173.0 6.1 36.6 0% 1.7 0.3 88% 1 6806.1 0.2 0.1 3.0 dm-0 731.8 16.0 11.5 0% 0.4 0.5 6064.2 5.8 34.1 0% 1.9 0.3 90% 2 6795.9 0.2 0.1 4.0 dm-0 711.1 16.0 11.1 0% 0.3 0.5 6448.5 5.4 34.3 0% 1.8 0.3 92% 2 7159.6 0.2 0.1 5.0 dm-0 700.1 16.0 10.9 0% 0.4 0.5 5689.4 5.8 32.2 0% 1.9 0.3 88% 0 6389.5 0.2 0.1 6.0 dm-0 774.1 16.0 12.1 0% 0.3 0.4 6409.5 5.5 34.2 0% 1.7 0.3 86% 0 7183.5 0.2 0.1 7.0 dm-0 849.6 16.0 13.3 0% 0.4 0.5 6151.2 5.4 32.3 0% 1.9 0.3 88% 3 7000.8 0.2 0.1 8.0 dm-0 664.2 16.0 10.4 0% 0.3 0.5 6349.2 5.7 35.1 0% 2.0 0.3 90% 2 7013.4 0.2 0.1 9.0 dm-0 951.0 16.0 14.9 0% 0.4 0.4 5807.0 5.3 29.9 0% 1.8 0.3 90% 3 6758.0 0.2 0.1 10.0 dm-0 742.0 16.0 11.6 0% 0.3 0.5 6461.1 5.1 32.2 0% 1.7 0.3 87% 1 7203.2 0.2 0.1
Let me show you a final example. I used –interval and –iterations parameters for pt-diskstats which tells us to wait for a number of seconds before printing the next disk stats and to limit the number of samples respectively. If you notice, you will see in 3rd iteration high latency (rd_rt, wr_rt) mostly for reads. Also, you can notice a high value for queue time (qtime) and service time (stime) where qtime is related to disk IO scheduler settings. For MySQL database servers we usually recommends noop/deadline instead of default cfq.$ pt-diskstats --interval=20 --iterations=3 #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 10.4 hda 11.7 4.0 0.0 0% 0.0 1.1 40.7 11.7 0.5 26% 0.1 2.1 10% 0 52.5 0.4 1.5 10.4 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.4 7.0 0.0 43% 0.0 0.1 0% 0 0.4 0.0 0.1 10.4 hda3 0.0 0.0 0.0 0% 0.0 0.0 0.4 107.0 0.0 96% 0.0 0.2 0% 0 0.4 0.0 0.2 10.4 hda5 0.0 0.0 0.0 0% 0.0 0.0 0.7 20.0 0.0 80% 0.0 0.3 0% 0 0.7 0.1 0.2 10.4 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.1 4.0 0.0 0% 0.0 4.0 0% 0 0.1 0.0 4.0 10.4 hda9 11.7 4.0 0.0 0% 0.0 1.1 39.2 10.7 0.4 3% 0.1 2.7 9% 0 50.9 0.5 1.8 10.4 drbd1 11.7 4.0 0.0 0% 0.0 1.1 39.1 10.7 0.4 0% 0.1 2.8 9% 0 50.8 0.5 1.7 20.0 hda 14.6 4.0 0.1 0% 0.0 1.4 39.5 12.3 0.5 26% 0.3 6.4 18% 0 54.1 2.6 2.7 20.0 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.4 9.1 0.0 56% 0.0 42.0 3% 0 0.4 0.0 42.0 20.0 hda3 0.0 0.0 0.0 0% 0.0 0.0 1.5 22.3 0.0 82% 0.0 1.5 0% 0 1.5 1.2 0.3 20.0 hda5 0.0 0.0 0.0 0% 0.0 0.0 1.1 18.9 0.0 79% 0.1 21.4 11% 0 1.1 0.1 21.3 20.0 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.8 10.4 0.0 62% 0.0 1.5 0% 0 0.8 1.3 0.2 20.0 hda9 14.6 4.0 0.1 0% 0.0 1.4 35.8 11.7 0.4 3% 0.2 4.9 18% 0 50.4 0.5 3.5 20.0 drbd1 14.6 4.0 0.1 0% 0.0 1.4 36.4 11.6 0.4 0% 0.2 5.1 17% 0 51.0 0.5 3.4 20.0 hda 0.9 4.0 0.0 0% 0.2 251.9 28.8 61.8 1.7 92% 4.5 13.1 31% 2 29.6 12.8 0.9 20.0 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.6 8.3 0.0 52% 0.1 98.2 6% 0 0.6 48.9 49.3 20.0 hda3 0.0 0.0 0.0 0% 0.0 0.0 2.0 23.2 0.0 83% 0.0 1.4 0% 0 2.0 1.2 0.3 20.0 hda5 0.0 0.0 0.0 0% 0.0 0.0 4.9 249.4 1.2 98% 4.0 13.2 9% 0 4.9 12.9 0.3 20.0 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 20.0 hda9 0.9 4.0 0.0 0% 0.2 251.9 21.3 24.2 0.5 32% 0.4 12.9 31% 2 22.2 10.2 9.7 20.0 drbd1 0.9 4.0 0.0 0% 0.2 251.9 30.6 17.0 0.5 0% 0.7 24.1 30% 5 31.4 21.0 9.5
You can see the busy column in pt-diskstats output which is the same as the util column in iostat – which points to utilization. Actually, pt-diskstats is quite similar to the iostat tool but pt-diskstats is more interactive and has more information. The busy percentage is only telling us for how long the IO subsystem was busy, but is not indicating capacity. So the only time you care about %busy is when it’s 100% and at the same time latency (await in iostat and rd_rt/wr_rt in diskstats output) increases over -say- 5ms. You can estimate capacity of your IO subsystem and then look at the IOPS being consumed (r/s + w/s columns). Also, the system can process more than one request in parallel (in case of RAID) so %busy can go beyond 100% in pt-diskstats output.
If you need to check disk throughput, block device IOPS run the following to capture metrics from your IO subsystem and see if utilization matches other worrisome symptoms. I would suggest capturing disk stats during peak load. Output can be grouped by sample or by disk using the –group-by option. You can use the sysbench benchmark tool for this purpose to measure database server performance. You will find this link useful for sysbench tool details.$ pt-diskstats --group-by=all --iterations=7200 > /tmp/pt-diskstats.out;
pt-diskstats is one of the finest tools from Percona Toolkit. By using this tool you can easily spot disk bottlenecks, measure the IO subsystem and identify how much IOPS your drive can handle (i.e. disk capacity).
Join me online next week (September 10 at 10 a.m. PDT) for my live webinar on Migrating to Percona XtraDB Cluster. This was a popular webinar that I gave a few years ago, so I’m doing it again with updates for Percona XtraDB Cluster 5.6 (PXC) and all the latest in the Galera world.
This webinar will be really good for people interested in getting an overview of what PXC/Galera is, what it would take to adopt it for your application, and some of the differences and challenges it brings compared with a conventional MySQL Master/slave setup. I’d highly suggest attending if you are considering Galera in your environment and want to get a better understanding of its uses and antipatterns.
Additionally, I’ll cover such questions as:
- What are the requirements for running Percona XtraDB Cluster?
- Will I have to reload all my tables?
- How does configuration for the cluster differ from configuring a stand-alone InnoDB server?
- How should my application interact with the Cluster?
- Can I use Percona XtraDB Cluster if I only have two MySQL servers currently?
- How can I move to the Cluster and keep downtime to a minimum?
- How can I migrate to Percona XtraDB Cluster gradually?
I hope to see you next Wednesday. And please feel free to ask questions in advance in the comments section below. Next week’s live event, like all of our MySQL webinars, is free. Register here!
The post Migrating to Percona XtraDB Cluster 2014 edition: Sept. 10 MySQL webinar appeared first on MySQL Performance Blog.