Posted on

MariaDB Galera cluster and GTID

In MariaDB 10.2.12, these two don’t yet work together. GTID = Global Transaction ID.  In the master-slave asynchronous replication realm, this means that you can reconnect a slave to another server (change its master) and it’ll happily continue replicating from the correct point.  No more fussing with filenames and offsets (which of course will both differ on different machines).

So in concept the GTIID is “globally” unique – that means it’s consistent across an entire infra: a binlogged write transaction will have the same GTID no matter on which machine you look at it.

  • OK: if you are transitioning from async replication to Galera cluster, and have a cluster as slave of the old infra, then GTID will work fine.
  • PROBLEM: if you want to run an async slave in a Galera cluster, GTID will currently not work. At least not reliably.

The overview issue is MDEV-10715, the specific problem is documented in MDEV-14153 with some comments from me from late last week. MDEV-14153 documents cases where the GTID is not in fact consistent – and the way in which it isn’t is most disturbing.

The issue appears as “drift”. A GTID is made up of R-S-# where R is replication domain (0 unless set by an app), S for server-id where the write was originally done, and # which is just a number. The only required rule for the # is that that each next event has to have a higher number than the previous.  In principle there could be #s missing, that’s ok.

In certain scenarios, the # part of the GTID falls behind on the “other nodes” in the Galera cluster. There was the node where the statement was first issued, and then there are the other nodes which pick up the change through the Galera (wsrep) cluster mechanism. Those other nodes.  So at that point, different nodes in the cluster have different GTIDs for the same query. Not so great.

To me, this looked like a giant red flag though: if a GTID is assigned on a commit, and then replicated through the cluster as part of that commit, it can’t change. Not drift, or any other change. So the only possible conclusion must be that it is in fact not passed through the cluster, but “reinvented” by a receiving cluster node, which simply assumes that the current event from a particular server-id is previous-event id + 1.  That assumption is false, because as I mentioned above it’s ok for gaps to exist.  As long as the number keeps going up, it’s fine.

Here is one of the simplest examples of breakage (extract from a binlog, with obfuscated table names):

# at 12533795
#180704 5:00:22 server id 1717 end_log_pos 12533837 CRC32 0x878fe96e GTID 0-1717-1672559880 ddl
/*!100001 SET @@session.gtid_seq_no=1672559880*//*!*/;
# at 12533837
#180704 5:00:22 server id 1717 end_log_pos 12534024 CRC32 0xc6f21314 Query thread_id=4468 exec_time=0 error_code=0
SET TIMESTAMP=1530644422/*!*/;
SET @@session.time_zone='SYSTEM'/*!*/;
DROP TEMPORARY TABLE IF EXISTS `qqq`.`tmp_foobar` /* generated by server */
/*!*/;

Fact: temporary tables are not replicated (imagine restarting a slave, it wouldn’t have whatever temporary tables were supposed to exist). So, while this event is stored in the binary log (which it is to ensure that if you replay the binlog on a machine, it correctly drops the temporary table after creating and using it), it won’t go through a cluster.  Remember that Galera cluster is essentially a ROW-based replication scheme; if there are changes in non-temporary tables, of course they get replicated just fine.  So if an app creates a temporary table, does some calculations, and then inserts the result of that into a regular table, the data of that last bit will get replicated. As it should. In a nutshell, as far as data consistency goes, we’re all fine.

But the fact that we have an event that doesn’t really get replicated creates the “fun” in the “let’s assume the next event is just the previous + 1” logic. This is where the drift comes in. Sigh.

In any case, this issue needs to be fixed by let’s say “being re-implemented”: the MariaDB GTID needs to be propagated through the Galera cluster, so it’s the same on every server, as it should be. Doing anything else is always going to go wrong somewhere, so trying to catch more cases like the above example is not really the correct way to go.

If you are affected by this or related problems, please do vote on the relevant MDEV issues. That is important!  If you need help tracking down problems, feel free to ask.  If you have more information on the matter, please comment too!  I’m sure this and related bugs will be fixed, there are very capable developers at MariaDB Corp and Codership Oy (the Galera company). But the more information we can provide, the better. It often helps with tracking down problems and creating reproducible test cases.

Posted on
Posted on

innodb_flush_logs_on_trx_commit and Galera Cluster

We deploy Galera Cluster (in MariaDB) for some clients, and innodb_flush_logs_on_trx_commit is one of the settings we’ve been playing with. The options according to the manual:

  • =0 don’t write or flush at commit, write and flush once per second
  • =1 write and flush at trx commit
  • =2 write log, but only flush once per second

The flush (fsync) refers to the mechanism the filesystem uses to try and guarantee that written data is actually on the physical medium/device and not just in a buffer (of course cached RAID controllers, SANs and other devices use some different logic there, but it’s definitely written beyond the OS space).

In a non-cluster setup, you’d always want it to be =1 in order to be ACID compliant and that’s also InnoDB’s default. So far so good. For cluster setups, you could be more lenient with this as you require ACID on the cluster as a whole, not each individual machine – after all, if one machine drops out at any point, you don’t lose any data.

Codership docu recommended =2, so that’s what Open Query engineer Peter Lock initially used for some tests that he was conducting. However, performance wasn’t particularly shiny – actually not much higher than =1. That in itself is interesting, because typically we regard the # of fsyncs/second a storage system can deal with as a key indicator of performance capacity. That is, as our HD Latency tool shows when you run it on a storage device (even your local laptop harddisk), the most prominent aspect of what limits the # of writes you can do per second appears to be the fsyncs.

I then happened to chat with Oli Sennhauser (former colleague from MySQL AB) who now runs the FromDual MySQL/MariaDB consulting firm in Switzerland, and he’s been working with Galera for quite a long time. He recognised the pattern and said that he too had that experience, and he thought =0 might be the better option.

I delved into the InnoDB source code to see what was actually happening, and the code indeed concurs with what’s described in the manual (that hasn’t always been the case ;-). I also verified this with Jeremy Cole whom we may happily regard as guru on “how InnoDB actually works”. The once-per-second flush (and optional preceding write) is performed by the InnoDB master thread. Take a peek in log/log0log.c and trx/trx0trx.c, specifically trx_commit_off_kernel() and srv_sync_log_buffer_in_background().

In conclusion:

  1. Even with =0, the log does get written and flushed once per second. This is done in the background so connection threads don’t have to wait for it.
  2. There is no setting where there is never a flush/fsync.
  3. With =2, the writing of the log takes place in the connection thread and this appears to incur a significant overhead, at least relative to =0. Aside from the writing of the log at transaction commit, there doesn’t appear to be a difference.
  4. Based on the preceding points, I would say that if you don’t want =1, you might as well set =0 in order to get the performance you’re after. There is of course a slight practical difference between =0 and =2. With =2 the log is immediately written. If the mysqld process were to crash within a second after that, the OS would close the file and have that log write stored. With =0 that log data wouldn’t have been written. If the OS or machine fails, that log write is lost either way.

In production environments, we tend to mainly want to mitigate trouble from system failures, so =0 appears to be a suitable/appropriate option – for a Galera cluster environment.

What remains is the question of why the log write operation appears to reduce transaction commit performance so much, in a way more so than the flush/fsync. Something to investigate further!
Your thoughts?

Posted on
Posted on 2 Comments

Galera pre-deployment check

One of the first things we do when preparing a client’s infrastructure for Galera deployment is see whether their schema is suitable.

  • Avoiding quirks and edge cases, we can say that Galera simply requires all tables to be InnoDB and also have a PRIMARY KEY (obviously having a PK in InnoDB is important anyway, for InnoDB-internal reasons).
  • We want to know about FULLTEXT indexes. With recent InnoDB versions also supporting FULLTEXT we need to check not just whether a table has such an index, but actually which engine it is.
  • Spatial indexes. While both InnoDB and MyISAM can deal with spatial datatypes (POINT, GEOMETRY, etc), only MyISAM has the spatial indexes.

Naturally, checking a schema in the server is more effective than going through other sources and possibly missing bits. On the downside, the only viable way to get this info out of MariaDB is INFORMATION_SCHEMA, but because of the way it’s implemented queries tend to be slow and resource intensive. So essentially we do need to ask I_S, but do it as efficiently as possible (we’re dealing with production systems). We have multiple separate questions to ask, which normally we’d ask in separate queries, but in case of I_S that’s really something to avoid. So that’s why it’s all integrated into the single query below, catching every permutation of “not InnoDB”, “lacks primary key”, “has fulltext or spatial index”. We skip the system databases and any VIEWs.

We use the lesser known mysql client command ‘tee’ to output the data into a file, and close it after the query.

We publish the query not as a work of art – I don’t think it’s that pretty! We’d like you to see because we don’t care for secrets and also because if there is any way you can reach the same objective using a less resource intensive approach, we’d love to hear about it! This is one of the very few cases where we care only about efficiency, not how pretty the query looks. That said, of course I’d prefer it to be easily readable.

If you regard it purely as a query to be used for Galera, then you can presume it’ll be run on MariaDB 5.5 or later – since 5.3 and above has optimised subqueries, perhaps you can do something with that.

If you spot any other flaw or missing bit, please comment on that too. thanks!

-- snip
tee galeracheck.txt

SELECT DISTINCT
       CONCAT(t.table_schema,'.',t.table_name) as tbl,
       t.engine,
       IF(ISNULL(c.constraint_name),'NOPK','') AS nopk,
       IF(s.index_type = 'FULLTEXT','FULLTEXT','') as ftidx,
       IF(s.index_type = 'SPATIAL','SPATIAL','') as gisidx
  FROM information_schema.tables AS t
  LEFT JOIN information_schema.key_column_usage AS c
    ON (t.table_schema = c.constraint_schema AND t.table_name = c.table_name
        AND c.constraint_name = 'PRIMARY')
  LEFT JOIN information_schema.statistics AS s
    ON (t.table_schema = s.table_schema AND t.table_name = s.table_name
        AND s.index_type IN ('FULLTEXT','SPATIAL'))
  WHERE t.table_schema NOT IN ('information_schema','performance_schema','mysql')
    AND t.table_type = 'BASE TABLE'
    AND (t.engine <> 'InnoDB' OR c.constraint_name IS NULL OR s.index_type IN ('FULLTEXT','SPATIAL'))
  ORDER BY t.table_schema,t.table_name;

notee
-- snap

Credit: the basis of the “find tables without a PK” is based on SQL by Sheeri Cabral and Giuseppe Maxia.

Posted on 2 Comments
Posted on 2 Comments

MySQL Cluster on Raspberry Pi

Earlier this week, Andrew Morgan wrote a piece on running MySQL Cluster on Raspberry Pi. Since the term “Cluster” is hideously overloaded, I’ll note that we’re talking about the NDB cluster storage engine here, a very specific architecture originally acquired by MySQL AB from Ericsson (telco).

Raspberry Pi is a new single-board computer based on the ARM processor series (same stuff that powers most mobile phones these days), and it can run Linux without any fuss. Interfaces include Ethernet, USB, and HDMI video, and the cost is $25-50. I’m looking to use one for the front-end of a MythTV setup (digital video recorder and TV system), I can just strap the Raspberry Pi to the back of a TV or monitor to do its job.

As Andrew already notes, in practical terms you’re not likely to use Raspberry Pi for a cluster – perhaps for development and certain testing, and it’d be a neat solid state management server. Primarily, it’s “techie cool”.

Knowing the NDB architecture, one of the key issues is that all nodes need to communicate with each other (NxN) so the system is very network intensive, and network latency significantly affects performance. So commonly, a cluster would have at least separate interfaces for direct connections to its siblings (no switch), and possibly Dolphin Interconnect cards to provide a link with much less latency than regular Ethernet offers. And you can’t do either with Raspberry Pi.

However, there are important positive lessons in this setup:

  • Using the open source nature of the software it can be utilised in a new environment with only minimal tweaks. Not everybody needs to or wants to tweak, but the ability to do so is critical to innovation.
  • Overall, scaling out rather than up makes sense. There are cost, power-efficiency and other factors involved. More, cheap, relatively low-powered, systems can deliver a system architecture that would otherwise be unaffordable (and the expensive construct might not scale anyway).
  • Affordable resilience (redundancy).

What if you needed lots of MySQL slaves with a fairly small dataset? Raspberry Pi could well be the solution. Not everybody is “big” or “high performance” in the same way.

Posted on 2 Comments
Posted on

LexisNexis open sources code for Hadoop alternative

Posted on
Posted on

New Open Query training days in Australia

The favourite Open Query course modules as well as reworked and brand new ones, with November/December 2009 dates for Brisbane, Sydney, Canberra and Melbourne listed below. You can register for days/modules individually, to suit your time, budget and current needs. Your trainers are Sean, Ray and Arjen (see OQ people).

For the Canberra and Melbourne days which are DBA/HA, registrations for all of the modules in a series before 15 October will receive a copy of the “High Performance MySQL” book (normal bookstore price is AUD 105).

Canberra

Sydney

Brisbane

  • Thu 19 Nov: MySQL Query Performance Optimisation and Tuning
  • Fri 20 Nov: MySQL Server Performance Optimisation and Tuning

Melbourne

Posted on