Posted on

Press Release 2018-09-11: Open Query acquired by Catalyst IT Australia Pty Limited

We are pleased to announce that Open Query, Queensland-based provider of MySQL, MariaDB and related services which just celebrated its 11-th anniversary, has been acquired by Catalyst IT Australia.

Founded in New Zealand in 1997, Catalyst is an experienced and respected Open Source integrator.  Catalyst is looking forward to the opportunity to work with the current Open Query clients as well as with new prospects. Catalyst offers a broad suite of Enterprise services, including support and custom development for Drupal CMS, SilverStripe CMS, Moodle LMS, Samba and other software, as well as fully managed hosting on AWS and other platforms.

“Catalyst’s core values are very much aligned with those of Open Query, which is why we are particularly pleased with this outcome”, notes Arjen Lentz, Founder and Exec.Director of Open Query.

Catalyst IT Australia has offices in Sydney, Melbourne and Brisbane.

Contacts

For Open Query Pty Ltd

Arjen Lentz, Exec.Director
https://openquery.com.au

For Catalyst IT Australia Pty Limited

Andrew Boag, Managing Director
https://www.catalyst-au.net/
Phone (02) 8203 9777

Posted on
Posted on

My oldest still-open MySQL bug

a bugThis morning I received an update on a MySQL bug, someone added a comment on an issue I filed in November 2003, originally for MySQL 4.0.12 and 4.1.0. It’s MySQL bug#1956 (note the very low number!), “Parser allows naming of PRIMARY KEY”:

[25 Nov 2003 16:23] Arjen Lentz
Description:
When specifying a PRIMARY KEY, the parser still accepts a name for the index even though the MySQL server will always name it PRIMARY.
So, the parser should NOT accept this, otherwise a user can get into a confusing situation as his input is ignored/changed.

How to repeat:
CREATE TABLE pk (i INT NOT NULL);
ALTER TABLE pk ADD PRIMARY KEY bla (i);

'bla' after PRIMARY KEY should not be accepted.

Suggested fix:
Fix grammar in parser.

Most likely we found it during a MySQL training session, training days have always been a great bug catching source as students would be encouraged to try lots of quirky things and explore.

It’s not a critical issue, but one from the “era of sloppiness” as I think we may now call it, with the benefit of hindsight.  At the time, it was just regarded as lenient: the parser would silently ignore things it couldn’t process in a particular context.  Like creating a foreign key on a table using an engine that didn’t support foreign keys would see the foreign keys silently disappear, rather than reporting an error.  Many  if not most of those quirks have been cleaned up over the years.  Some are very difficult to get rid of, as people use them in code and thus essentially we’re stuck with old bad behaviour as otherwise we’d break to many applications.  I think that one neat example of that is auto-casting:

SELECT "123 apples" + 1;
+------------------+
| "123 apples" + 1 |
+------------------+
|              124 |
+------------------+
1 row in set, 1 warning (0.000 sec)

SHOW WARNINGS;
+---------+------+-----------------------------------------------+
| Level   | Code | Message                                       |
+---------+------+-----------------------------------------------+
| Warning | 1292 | Truncated incorrect DOUBLE value: '123 hello' |
+---------+------+-----------------------------------------------+

At least that one chucks a warning now, which is good as it allows developers to catch mistakes and thus prevent trouble.  A lenient parser (or grammar) can be convenient, but it tends to also enable application developers to be sloppy, which is not really a good thing.  Something that creates a warning may be harmless, or an indication of a nasty mistake: the parser can’t tell.  So it’s best to ensure that code is clean, free even of warnings.

Going back to the named PRIMARY KEY issue… effectively, a PRIMARY KEY has the name ‘PRIMARY’ internally.  So it can’t really have another name, and the parser should not accept an attempt to name it.  The name silently disappears so when you check back in SHOW CREATE TABLE or SHOW INDEXES, you won’t ever see it.
CREATE TABLE pkx (i INT PRIMARY KEY x) reports a syntax error. So far so good.
But, CREATE TABLE pkx (i INT, PRIMARY KEY x (i)) is accepted, as is ALTER TABLE pkx ADD PRIMARY KEY x (i).

MySQL 8.0 still allows these constructs, as does MariaDB 10.3.9 !

The bug is somewhere in the parser, so that’s sql/sql_yacc.yy.  I’ve tweaked and added things in there before, but it’s been a while and particularly in parser grammar you have to be very careful that changes don’t have other side-effects.  I do think it’s worthwhile fixing even these minor issues.

Posted on
Posted on

MariaDB Galera cluster and GTID

In MariaDB 10.2.12, these two don’t yet work together. GTID = Global Transaction ID.  In the master-slave asynchronous replication realm, this means that you can reconnect a slave to another server (change its master) and it’ll happily continue replicating from the correct point.  No more fussing with filenames and offsets (which of course will both differ on different machines).

So in concept the GTIID is “globally” unique – that means it’s consistent across an entire infra: a binlogged write transaction will have the same GTID no matter on which machine you look at it.

  • OK: if you are transitioning from async replication to Galera cluster, and have a cluster as slave of the old infra, then GTID will work fine.
  • PROBLEM: if you want to run an async slave in a Galera cluster, GTID will currently not work. At least not reliably.

The overview issue is MDEV-10715, the specific problem is documented in MDEV-14153 with some comments from me from late last week. MDEV-14153 documents cases where the GTID is not in fact consistent – and the way in which it isn’t is most disturbing.

The issue appears as “drift”. A GTID is made up of R-S-# where R is replication domain (0 unless set by an app), S for server-id where the write was originally done, and # which is just a number. The only required rule for the # is that that each next event has to have a higher number than the previous.  In principle there could be #s missing, that’s ok.

In certain scenarios, the # part of the GTID falls behind on the “other nodes” in the Galera cluster. There was the node where the statement was first issued, and then there are the other nodes which pick up the change through the Galera (wsrep) cluster mechanism. Those other nodes.  So at that point, different nodes in the cluster have different GTIDs for the same query. Not so great.

To me, this looked like a giant red flag though: if a GTID is assigned on a commit, and then replicated through the cluster as part of that commit, it can’t change. Not drift, or any other change. So the only possible conclusion must be that it is in fact not passed through the cluster, but “reinvented” by a receiving cluster node, which simply assumes that the current event from a particular server-id is previous-event id + 1.  That assumption is false, because as I mentioned above it’s ok for gaps to exist.  As long as the number keeps going up, it’s fine.

Here is one of the simplest examples of breakage (extract from a binlog, with obfuscated table names):

# at 12533795
#180704 5:00:22 server id 1717 end_log_pos 12533837 CRC32 0x878fe96e GTID 0-1717-1672559880 ddl
/*!100001 SET @@session.gtid_seq_no=1672559880*//*!*/;
# at 12533837
#180704 5:00:22 server id 1717 end_log_pos 12534024 CRC32 0xc6f21314 Query thread_id=4468 exec_time=0 error_code=0
SET TIMESTAMP=1530644422/*!*/;
SET @@session.time_zone='SYSTEM'/*!*/;
DROP TEMPORARY TABLE IF EXISTS `qqq`.`tmp_foobar` /* generated by server */
/*!*/;

Fact: temporary tables are not replicated (imagine restarting a slave, it wouldn’t have whatever temporary tables were supposed to exist). So, while this event is stored in the binary log (which it is to ensure that if you replay the binlog on a machine, it correctly drops the temporary table after creating and using it), it won’t go through a cluster.  Remember that Galera cluster is essentially a ROW-based replication scheme; if there are changes in non-temporary tables, of course they get replicated just fine.  So if an app creates a temporary table, does some calculations, and then inserts the result of that into a regular table, the data of that last bit will get replicated. As it should. In a nutshell, as far as data consistency goes, we’re all fine.

But the fact that we have an event that doesn’t really get replicated creates the “fun” in the “let’s assume the next event is just the previous + 1” logic. This is where the drift comes in. Sigh.

In any case, this issue needs to be fixed by let’s say “being re-implemented”: the MariaDB GTID needs to be propagated through the Galera cluster, so it’s the same on every server, as it should be. Doing anything else is always going to go wrong somewhere, so trying to catch more cases like the above example is not really the correct way to go.

If you are affected by this or related problems, please do vote on the relevant MDEV issues. That is important!  If you need help tracking down problems, feel free to ask.  If you have more information on the matter, please comment too!  I’m sure this and related bugs will be fixed, there are very capable developers at MariaDB Corp and Codership Oy (the Galera company). But the more information we can provide, the better. It often helps with tracking down problems and creating reproducible test cases.

Posted on
Posted on

PURGE BINARY LOGS with a relative time

Sometimes you want to reduce disk usage on certain servers by adjusting the time that binary logs are kept.  Also, some installations of MySQL and MariaDB have suffered from a very-hard-to-catch bug where the binary logs end up not getting automatically expired (basically, the expire_logs_days option doesn’t always work effectively).

A workaround can be scripted, but typically the script would specify the exact datetime to which the logs need to be kept.  The reference manual and examples all do this too, quite explicitly, noting:

The datetime expression is in the format ‘YYYY-MM-DD hh:mm:ss’.

However, the actual command syntax is phrased as follows:

PURGE { BINARY | MASTER } LOGS { TO ‘log_name’ | BEFORE datetime_expr }

and that indicates much more flexibility in the parser: “datetime_expr” means that you can put in an arbitrary temporal expression!

So let’s test that, with a functional equivalent of expire_logs_days=14:

FLUSH BINARY LOGS;
PURGE BINARY LOGS BEFORE (NOW() – INTERVAL 14 DAY);

And yep, that works (and the extra parenthesis around the expression are not required, I just did that to show clearly what the expression is).

Now, I’m not the first person to use this construct, there are several posts online from recent years that use an expression with PURGE BINARY LOGS.  I’m not sure whether allowing datetime_expr is a modification that was made in the parser at some point, or whether it was always possible.  Fact is, the reference manual text (MariaDB as well as MySQL) only provide examples with an absolute ISO datetime: ‘YYYY-MM-DD HH:MM:SS’.

Posted on
Posted on

RDS Aurora MySQL Cost

big bag of money getting carriedI promised to do a pricing post on the Amazon RDS Aurora MySQL pricing, so here we go.  All pricing is noted in USD (we’ll explain why)

We compared pricing of equivalent EC2+EBS server instances, and verified our calculation model with Amazon’s own calculator and examples.  We use the pricing for Australia (Sydney data centre). Following are the relevant Amazon pricing pages from which we took the pricing numbers, formulae, and calculation examples:

Base Pricing Details

Specs         EC2     RDS Aurora MySQL  
instance type vCPU ECU GB RAM   Storage Linux/hr   instance type Price/hr
r4.large 2 7 15.25 EBS Only $0.160 db.r4.large $0.350
r4.xlarge 4 13.5 30.5 EBS Only $0.319 db.r4.xlarge $0.700
r4.2xlarge 8 27 61 EBS Only $0.638 db.r4.2xlarge $1.400
r4.4xlarge 16 53 122 EBS Only $1.277 db.r4.4xlarge $2.800
r4.8xlarge 32 99 244 EBS Only $2.554 db.r4.8xlarge $5.600
r4.16xlarge 64 195 488 EBS Only $5.107 db.r4.16xlarge $11.200

That’s not all we need, because both EBS and Aurora have some additional costs we need to factor in.

EBS pricing components (EBS Provisioned IOPS SSD (io1) volume)

“Volume storage for EBS Provisioned IOPS SSD (io1) volumes is charged by the amount you provision in GB per month until you release the storage. With Provisioned IOPS SSD (io1) volumes, you are also charged by the amount you provision in IOPS (input/output operations per second) per month. Provisioned storage and provisioned IOPS for io1 volumes will be billed in per-second increments, with a 60 second minimum.”

  • Storage Rate $0.138 /GB/month of provisioned storage
    “For example, let’s say that you provision a 2000 GB volume for 12 hours (43,200 seconds) in a 30 day month. In a region that charges $0.125 per GB-month, you would be charged $4.167 for the volume ($0.125 per GB-month * 2000 GB * 43,200 seconds / (86,400 seconds/day * 30 day-month)).”
  • I/O Rate $0.072 /provisioned IOPS-month
    “Additionally, you provision 1000 IOPS for your volume. In a region that charges $0.065 per provisioned IOPS-month, you would be charged $1.083 for the IOPS that you provisioned ($0.065 per provisioned IOPS-month * 1000 IOPS provisioned * 43,200 seconds /(86,400 seconds /day * 30 day-month)).”

Other Aurora pricing components

  • Storage Rate $0.110 /GB/month
    (No price calculation examples given for Aurora storage and I/O)
  • I/O Rate $0.220 /1 million requests
    (Presuming IOPS equivalence / Aurora ratio noted from arch talk)

So this provides us with a common base, instance types that are equivalent between Aurora and EC2.  All other Aurora instances types are different, so it’s not possible to do a direct comparison in those cases.  Presumably we can make the assumption that the pricing ratio will similar for equivalent specs.

On Demand vs Reserved Instances

We realise we’re calculating on the basis of On Demand pricing.  But we’re comparing pricing within AWS space, so presumably the savings for Reserved Instances are in a similar ballpark.

Other factors

  • We have 720 hours in a 30 day month, which is 2592000 seconds.
  • 70% read/write ratio – 70% reads (used to calculate the effective Aurora IOPS)
  • 10% read cache miss -10% cache miss rate on reads
  • Aurora I/O ratio: 3 (Aurora requiring 2 IOPS for a commit vs 6 in MySQL – even though this is a pile of extreme hogwash in terms of that pessimistic MySQL baseline)

We also spotted this note regarding cross-AZ Aurora traffic:

“Amazon RDS DB Instances inside VPC: For data transferred between an Amazon EC2 instance and Amazon RDS DB Instance in different Availability Zones of the same Region, Amazon EC2 Regional Data Transfer charges apply on both sides of transfer.”

So this would apply to application DB queries issued across an AZ boundary, which would commonly happen during failover scenarios.  In fact, we know that this happens during regular operations with some EC2 setups, because the loadbalancing already goes cross-AZ.  So that costs extra also.  Now you know!  (note: we did not factor this in to our calculations.)

Calculation Divergence

Our model comes up with identical outcomes for the examples Amazon provided, however it comes up 10-15% lower than Amazon’s calculator for specific Aurora configurations.  We presume that the difference may lie in the calculated Aurora I/O rate, as that’s the only real “unknown” in the model.  Amazon’s calculator does not show what formulae it uses for the sub-components, nor sub-totals, and we didn’t bother to tweak until we got at the same result.

It’s curious though, as the the architecture talk makes specific claims about Aurora’s I/O efficiency (which presume optimal Aurora situation and a dismal MySQL reference setup, something which I already raised in our initial Aurora post).  So apparently the Amazon calculator assumes worse I/O performance than the technical architecture talk!

Anyhow, let’s just say our costing is conservative, as the actual cost is higher on the Aurora end.

Scenarios

Here we compare with say a MySQL/MariaDB Galera setup across 3 AZs running on EC2+EBS.  While this should be similar in overall availability and read-capacity, note that

  1. you can write to all nodes in a Galera cluster, whereas Aurora currently has a single writer/master;
  2. Galera doesn’t require failover changes as all its nodes are technically writers anyhow, whereas Aurora failover causes a cluster outage of at least 30 seconds.
Servers R/Zones Instance GB DB I/O rate     EC2 EBS     Aurora      
          Read IOPS   Instances Storage I/O EC2 Total   Instances Storage I/O Aurora Total
3 3 r4.xlarge 250 2,000 740 $689 $104 $160 $952   $1,512 $83 $141 $1,735
6 3 r4.xlarge 250 2,000 740 $1,378 $207 $320 $1,905   $3,024 $83 $141 $3,247
     

When using the Amazon calculator, Aurora comes out at about double the EC2.  But don’t take our word for it, do try this for yourself.

Currency Consequences

While pricing figures are distinct per country that Amazon operates in, the charges are always in USD.  So this means that the indicated pricing is, in the end, in USD, and thus subject to currency fluctuations (if your default currency is not USD).  What does this mean?

USD AUD rate chart 2008-2018
USD-AUD rate chart 2008-2018, from xe.com

So USD 1,000 can cost as little as AUD 906 or as much as AUD 1,653, at different times over the last 10 years.  That’s quite a range!

Conclusion

As shown above, our calculation with Aurora MySQL shows it costing about twice as much.  This is based on a reference MySQL/MariaDB+Galera with roughly the same scaling and resilience profile (e.g. the ability to survive DC outages).  In functional terms, particularly with Aurora’s 30+second outage profile during failover, Galera comes out on top at half the cost.

So when is Aurora cheaper, as claimed by Amazon?

Amazon makes claims in the realm of “1/10th the cost”. Well, that may well be the case when comparing with the TCO of Oracle or MS SQL Server, and it’s fairly typical when comparing a proprietary system with an Open Source based one (mind again that Aurora is not actually Open Source as Amazon does not make their source code available, but it’s based on MySQL).

The only other way we see is to seriously compromise on the availability (resilience).  In our second sample calculation, we use 2 instances per AZ.  This is not primarily for performance, but so that application servers in an AZ don’t have to do cross-DC queries when one instance fails.  In the case of Aurora, spinning up a new instance on the same dataset requires 15 minutes.  So, do you want to take that hit?  If so, you can save money there.  If not, it’s still costly.

But hang on, if you’re willing to make the compromise on availability, you could reduce the Galera setup also, to only one instance per AZ.  Yep!

So, no matter how you tweak it, Aurora is about twice the cost, with (in our opinion) a less interesting failover profile.

The Price of RDS Convenience

What you get with RDS/Aurora is the promise of convenience, and that’s what you pay for.  But, mind that our comparison worked all within AWS space anyway, the EC2 instances we used for MySQL/MariaDB+Galera already use the same basic infrastructure, dashboard and management API as well.  So you pay double just to go to RDS/Aurora, relative to building on EC2.

To us, that cost seems high.  If you spend some, or even all that money on engineering that convenience around your particular setup, and even outsource that task and its maintenance, you get a nicer setup at the same or a lower cost.  And last but not least, that cost will be more predictable – most likely the extra work will be charged in your own currency, too.

Cost Predictability and Budget

You can do a reasonable ball-park calculation of AWS EC2 instances that are always active, but EBS already has some I/O charges which make the actual cost rather more variable, and Aurora adds a few more variables on top of that.  I’m still amazed that companies go for this, even though they traditionally prefer a known fixed cost (even if higher) over a variable cost.  Choosing the variable cost breaks with some fundamental business rules, for the sake of some convenience.

The advantage of known fixed costs is that you can budget properly, as well as project future costs based on growth and other business factors.  Purposefully ditching that realm, while exposing yourself to currency fluctuations at the same time, seems most curious.  How do companies work this into their budgets?  Because others do so?  Well, following the neighbours is not always a good idea.  In this case, it might be costly as well as financially risky.

Posted on
Posted on

RDS Aurora MySQL and Service Interruptions

In Amazon space, any EC2 or Service instance can “disappear” at any time.  Depending on which service is affected, the service will be automatically restarted.  In EC2 you can choose whether an interrupted instance will be restarted, or left shutdown.

For an Aurora instance, an interrupted instance is always restarted. Makes sense.

The restart timing, and other consequences during the process, are noted in our post on Aurora Failovers.

Aurora Testing Limitations

As mentioned earlier, we love testing “uncontrolled” failovers.  That is, we want to be able to pull any plug on any service, and see that the environment as a whole continues to do its job.  We can’t do that with Aurora, because we can’t control the essentials:

  • power button;
  • reset switch;
  • ability to kill processes on a server;
  • and the ability to change firewall settings.

In Aurora, an instance is either running, or will (again) be running shortly.  So that we know.  Aurora MySQL also offers some commands that simulate various failure scenarios, but since they are built-in we can presume that those scenarios are both very well tested, as well as covered by the automation around the environment.  Those clearly defined cases are exactly the situations we’re not interested in.

What if, for instance, a server accepts new connections but is otherwise unresponsive?  We’ve seen MySQL do this on occasion.  Does Aurora catch this?  We don’t know and  we have no way of testing that, or many other possible problem scenarios.  That irks.

The Need to Know

If an automated system is able to catch a situation, that’s great.  But if your environment can end up in a state such as described above and the automated systems don’t catch and handle it, you could be dead in the water for an undefined amount of time.  If you have scripts to catch cases such as these, but the automated systems catch them as well, you want to be sure that you don’t trigger “double failovers” or otherwise interfere with a failover-in-progress.  So either way, you need to know and and be aware whether a situation is caught and handled, and be able to test specific scenarios.

In summary: when you know the facts, then you can assess the risk in relation to your particular needs, and mitigate where and as desired.

A corporate guarantee of “everything is handled and it’ll be fine” (or as we say in Australia “She’ll be right, mate!“) is wholly unsatisfactory for this type of risk analysis and mitigation exercise.  Guarantees and promises, and even legal documents, don’t keep environments online.  Consequently, promises and legalities don’t keep a company alive.

So what does?  In this case, engineers.  But to be able to do their job, engineers need to know what parameters they’re working with, and have the ability to test any unknowns.  Unfortunately Aurora is, also in this respect, a black box.  You have to trust, and can’t comprehensively verify.  Sigh.

Posted on
Posted on

RDS Aurora MySQL Failover

Right now Aurora only allows a single master, with up to 15 read-only replicas.

Master/Replica Failover

We love testing failure scenarios, however our options for such tests with Aurora are limited (we might get back to that later).  Anyhow, we told the system, through the RDS Aurora dashboard, to do a failover. These were our observations:

Role Change Method

Both master and replica instances are actually restarted (the MySQL uptime resets to 0).

This is quite unusual these days, we can do a fully controlled role change in classic asynchronous replication without a restart (CHANGE MASTER TO …), and Galera doesn’t have read/write roles as such (all instances are technically writers) so it doesn’t need role changes at all.

Failover Timing

Failover between running instances takes about 30 seconds.  This is in line with information provided in the Aurora FAQ.

Failover where a new instance needs to be spun up takes 15 minutes according to the FAQ (similar to creating a new instance from the dash).

Instance Availability

During a failover operation, we observed that all connections to the (old) master, and the replica that is going to be promoted, are first dropped, then refused (the connection refusals will be during the period that the mysqld process is restarting).

According to the FAQ, reads to all replicas are interrupted during failover.  Don’t know why.

Aurora can deliver a DNS CNAME for your writer instance. In a controlled environment like Amazon, with guaranteed short TTL, this should work ok and be updated within the 30 seconds that the shortest possible failover scenario takes.  We didn’t test with the CNAME directly as we explicitly wanted to observe the “raw” failover time of the instances themselves, and the behaviour surrounding that process.

Caching State

On the promoted replica, the buffer pool is saved and loaded (warmed up) on the restart; good!  Note that this is not special, it’s desired and expected to happen: MySQL and MariaDB have had InnoDB buffer pool save/restore for years.  Credit: Jeremy Cole initially came up with the buffer pool save/restore idea.

On the old master (new replica/slave), the buffer pool is left cold (empty).  Don’t know why.  This was a controlled failover from a functional master.

Because of the server restart, other caches are of course cleared also.  I’m not too fussed about the query cache (although, deprecated as it is, it’s currently still commonly used), but losing connections is a nuisance. More detail on that later in this article.

Statistics

Because of the instance restarts, the running statistics (SHOW GLOBAL STATUS) are all reset to 0. This is annoying, but should not affect proper external stats gathering, other than for uptime.

On any replica, SHOW ENGINE INNODB STATUS comes up empty. Always.  This seems like obscurity to me, I don’t see a technical reason to not show it.  I suppose that with a replica being purely read-only, most running info is already available through SHOW GLOBAL STATUS LIKE ‘innodb%’, and you won’t get deadlocks on a read-only slave.

Multi-Master

Aurora MySQL multi-master was announced at Amazon re:Invent 2017, and appears to currently be in restricted beta test.  No date has been announced for general availability.

We’ll have to review it when it’s available, and see how it works in practice.

Conclusion

Requiring 30 seconds or more for a failover is unfortunate, this is much slower than other MySQL replication (writes can failover within a few seconds, and reads are not interrupted) and Galera cluster environments (which essentially delivers continuity across instance failures – clients talking to the failed instance will need to reconnect to the loadbalancer/cluster to continue).

I don’t understand why the old master gets a cold InnoDB buffer pool.

I wouldn’t think a complete server restart should be necessary, but since we don’t have insight in the internals, who knows.

On Killing Connections (through the restart)

Losing connections across an Aurora cluster is a real nuisance that really impacts applications.  Here’s why:

When MySQL C client library (which most MySQL APIs either use or are modelled on) is disconnected, it passes back a specific error to the application.  When the application makes its next query call, the C client will automatically reconnect first (so the client does not have to explicitly reconnect).  So a client only needs to catch the error and re-issue its last command, and all will generally be fine.  Of course, if it relies on different SESSION settings, or was in the middle of a multi-statement transaction, it will need to do a bit more.

So, this means that the application has to handle disconnects gracefully without chucking hissy-fits at users, and I know for a fact that that’s not how many (most?) applications are written.  Consequently, an Aurora failover will make the frontend of most applications look like a disaster zone for about 30 seconds (provided functional instances are available for the failover, which is the preferred and best case scenario).

I appreciate that this is not directly Aurora’s fault, it’s sloppy application development that causes this, but it’s a real-world fact we have to deal with.  And, perhaps importantly: other cluster and replication options do not trigger this scenario.

Posted on
Posted on

Exploring Amazon RDS Aurora: replica writes and cache chilling

Our clients operate on a variety of platforms, and RDS (Amazon Relational Database Service) Aurora has received quite a bit of attention in recent times. On behalf of our clients, we look beyond the marketing, and see what the technical architecture actually delivers.  We will address specific topics in individual posts, this time checking out what the Aurora architecture means for write and caching behaviour (and thus performance).

What is RDS Aurora?

First of all, let’s declare the baseline.  MySQL Aurora is not a completely new RDBMS. It comprises a set of Amazon modifications on top of stock Oracle MySQL 5.6 and 5.7, implementing a different replication mechanism and some other changes/additions.  While we have some information (for instance from the “deep dive” by AWS VP Anurag Gupta), the source code of the Aurora modifications are not published, so unfortunately it is not immediately clear how things are implemented.  Any architecture requires choices to be made, trade-offs, and naturally these have consequences.  Because we don’t get to look inside the “black box” directly, we need to explore indirectly.  We know how stock MySQL is architected, so by observing Aurora’s behaviour we can try to derive how it is different and what it might be doing.  Mind that this is equivalent to looking at a distant star, seeing a wobble, and deducing from the pattern that there must be one or more planets orbiting.  It’s an educated guess.

For the sake of brevity, I have to skip past some aspects that can be regarded as “obvious” to someone with insight into MySQL’s architecture.  I might also defer explaining a particular issue in depth to a dedicated post on that topic.  Nevertheless, please do feel free to ask “so why does this work in this way”, or other similar questions – that’ll help me check my logic trail and tune to the reader audience, as well as help create a clearer picture of the Aurora architecture.

Instead of using the binary log, Aurora replication ties into the storage layer.  It only supports InnoDB, and instead of doing disk reads/writes, the InnoDB I/O system talks to an Amazon storage API which delivers a shared/distributed storage, which can work across multiple availability zones (AZs).  Thus, a write on the master will appear on the storage system (which may or may not really be a filesystem).  Communication between AZs is fairly fast (only 2-3 ms extra overhead, relative to another server in the same AZ) so clustering databases or filesystems across AZs is entirely feasible, depending on the commit mechanism (a two-phase commit architecture would still be relatively slow).  We do multi-AZ clustering with Galera Cluster (Percona XtraDB Cluster or MariaDB Galera Cluster).  Going multi-AZ is a good idea that provides resilience beyond a single data centre.

So, imagine an individual instance in an Aurora setup as an EC2 (Amazon Elastic Computing) instance with MySQL using an SSD EBS (Amazon Elastic Block Storage) volume, where the InnoDB I/O threads interface more directly the the EBS API.  The actual architecture might be slightly different still (more on that in a later post), but this rough description helps set up a basic idea of what a node might look like.

Writes in MySQL

In a regular MySQL, on commit a few things happen:

  • the InnoDB log is written to and flushed,
  • the binary log is written to (and possibly flushed), and
  • the changed pages (data and indexes)  in the InnoDB buffer pool are marked dirty, so a background thread knows they need to be written back to disk (this does not need to happen immediately).  When a page is written to disk, normally it uses a “double-write” mechanism where first the original page is read and written to a scratch space, and then the new page is put in the original position.  Depending on the filesystem and underlying storage (spinning disk, or other storage with different block size from InnoDB page size) this may be required to be able to recover from write fails.

This does not translate in to as many IOPS because in practice, transaction commits are put together (for instance with MariaDB’s group commit) and thus many commits that happen in a short space effectively only use a few IOs for their log writes.  With Galera cluster, the local logs are written but not flushed, because the guaranteed durability is provided with other nodes in the cluster rather than local persistence of the logfile.

In Aurora, a commit has to send either the InnoDB log entries or the changed data pages to the storage layer; which one it is doesn’t particularly matter.  The storage layer has a “quorum set” mechanism to ensure that multiple nodes accept the new data.  This is similar to Galera’s “certification” mechanism that provides the “virtual synchrony”.  The Aurora “deep dive” talk claims that it requires many fewer IOPS for a commit; however, it appears they are comparing a worst-case plain MySQL scenario with an optimal Aurora environment.  Very marketing.

Aurora does not use the binary log, which does make one wonder about point-in-time recovery options. Of course, it is possible to recover to any point-in-time from an InnoDB snapshot + InnoDB transaction logs – this would require adding timestamps to the InnoDB transaction log format.

While it is noted that the InnoDB transaction log is also backed up to S3, it doesn’t appear to be used directly (so, only for recovery purposes then).  After all, any changed page needs to be communicated to the other instances, so essentially all pages are always flushed (no dirty pages).  When we look at the InnoDB stats GLOBAL STATUS, we sometimes do see up to a couple of dozen dirty pages with Aurora, but their existence or non-existence doesn’t appear to have any correlation with user-created tables and data.

Where InnoDB gets its Speed

InnoDB rows and indexing
InnoDB rows and indexing

We all know that disk-access is slow.  In order for InnoDB to be fast, it is dependent on most active data being in the buffer pool.  InnoDB does not care for local filesystem buffers – something is either in persistent storage, or in the buffer pool.  In configurations, we prefer direct I/O so the system calls that do the filesystem I/O bypass the filesystem buffers and any related overhead.  When a query is executed, any required page that’s not yet in the buffer pool is requested to be loaded in the background. Naturally, this does slow down queries, which is why we preferably want all necessary pages to already be in memory.  This applies for any type of query.  In InnoDB, all data/indexes are structured in B+trees, so an INSERT has to be merged into a page and possibly causes pages to be split and other items shuffled so as to “re-balance” the tree.  Similarly, a delete may cause page merges and a re-balancing operation.  This way the depth of the tree is controlled, so that even for a billion rows you would generally see a depth of no more than 6-8 pages.  That is, retrieving any row would only require a maximum of 6-8 page reads (potentially from disk).

I’m telling you all this, because while most replication and clustering mechanisms essentially work with the buffer pool, Aurora replication appears to works against it.  As I mentioned: choices have consequences (trade-offs).  So, what happens?

Aurora Replication

When you do a write in MySQL which gets replicated through classic asynchronous replication, the slaves or replica nodes affect the row changes in memory.  This means that all the data (which is stored with the PRIMARY KEY, in InnoDB) as well as any other indexes are updated, the InnoDB log is written, and the pages marked as dirty.  It’s very similar to what happens on the writer/master system, and thus the end result in memory is virtually identical.  While Galera’s cluster replication operates differently from the asynchronous mechanism shown in the diagram, the resulting caching (which pages are in memory) ends up similar.

MySQL Replication architecture
MySQL Replication architecture

Not so with Aurora.  Aurora replicates in the storage layer, so all pages are updated in the storage system but not in the in-memory InnoDB buffer pool.  A secondary notification system between the instances ensures that cached InnoDB pages are invalidated.  When you next do a query that needs any of those no-longer-valid cached pages, they will have to be be re-read from the storage system.  You can see a representation of this in the diagram below, indicating invalidated cache pages in different indexes; as shown, for INSERT operations, you’re likely to have pages higher up in the tree and one sideways page change as well because of the B+tree-rebalancing.

Aurora replicated insert
Aurora replicated insert

The Chilling Effect

We can tell the replica is reading from storage, because the same query is much slower than before we did the insert from the master instance.  Note: this wasn’t a matter of timing. Even if we waited slightly longer (to enable a possible background thread to refresh the pages) the post-insert query was just as slow.

Interestingly, the invalidation process does not actually remove them from the buffer pool (that is, the # of pages in the buffer pool does not go down); however, the # of page reads does not go up either when the page is clearly re-read.    Remember though that a status variable is just that, it has to be updated to be visible and it simply means that the new functions Amazon implemented don’t bother updating these status variables.  Accidental omission or purposeful obscurity?  Can’t say.  I will say that it’s very annoying when server statistics don’t reflect what’s actually going on, as it makes the stats (and their analysis) meaningless.  In this case, the picture looks better than it is.

With each Aurora write (insert/update/delete), the in-memory buffer pool on replicas is “chilled”.

Unfortunately, it’s not even just the one query on the replica that gets affected after a write. The primary key as well as the secondary indexes get chilled. If the initial query uses one particular secondary index, that index and the primary key will get warmed up again (at the cost of multiple storage system read operations), however the other secondary indexes are still chattering their teeth.

Being Fast on the Web

In web applications (whether websites or web-services for mobile apps), typically the most recently added data is the most likely to be read again soon.  This is why InnoDB’s buffer pool is normally very effective: frequently accessed pages remain in memory, while lesser used ones “age” and eventually get tossed out to make way for new pages.

Having caches clear due to a write, slows things down.  In the MySQL space, the fairly simply query cache is a good example.  Whenever you write to table A, any cached SELECTs that accesses table A are cleared out of the cache.  Regardless of whether the application is read-intensive, having regular writes makes the query cache useless and we turn it off in those cases.  Oracle has already deprecated the “good old” query cache (which was introduced in MySQL 4.0 in the early 2000s) and soon its code will be completely removed.

Conclusion

With InnoDB, you’d generally have an AUTO_INCREMENT PRIMARY KEY, and thus newly inserted rows are sequenced to that outer end of the B+Tree.  This also means that the next inserted row often ends up in the same page, again invalidating that recently written page on the replicas and slowing down reads of any of the rows it contained.

For secondary indexes, the effect is obviously scattered although if the indexed column is temporal (time-based), it will be similarly affected to the PRIMARY KEY.

How much all of this slows things down will very much depend on your application DB access profile.  The read/write ratio will matter little, but rather whether individual tables are written to fairly frequently.  If they do, SELECT queries on those tables made on replicas will suffer from the chill.

Aurora uses SSD EBS so of course the storage access is pretty fast.  However, memory is always faster, and we know that that’s important for web application performance.  And we can use similarly fast SSD storage on EC2 or another hosting provider, with mature scaling technologies such as Galera (or even regular asynchronous multi-threaded replication) that don’t give your caches the chills.

Posted on
Posted on 2 Comments

TEXT and VARCHAR inefficiencies in your db schema

The TEXT and VARCHAR definitions in many db schemas are based on old information – that is, they appear to be presuming restrictions and behaviour from MySQL versions long ago. This has consequences for performance. To us, use of for instance VARCHAR(255) is a key indicator for this. Yep, an anti-pattern.

VARCHAR

In MySQL 4.0, VARCHAR used to be restricted to 255 max. In MySQL 4.1 character sets such as UTF8 were introduced and MySQL 5.1 supports VARCHARs up to 64K-1 in byte length. Thus, any occurrence of VARCHAR(255) indicates some old style logic that needs to be reviewed.

Why not just set the maximum length possible? Well…

A VARCHAR is subject to the character set it’s in, for UTF8 this means either 3 or 4 (utf8mb4) bytes per character can be used. So if one specifies VARCHAR(50) CHARSET utf8mb4, the actual byte length of the stored string can be up to 200 bytes. In stored row format, MySQL uses 1 byte for VARCHAR length when possible (depending on the column definition), and up to 2 bytes if necessary. So, specifying VARCHAR(255) unnecessarily means that the server has to use a 2 byte length in the stored row.

This may be viewed as nitpicking, however storage efficiency affects the number of rows that can fit on a data page and thus the amount of I/O required to manage a certain amount of rows. It all adds up, so having little unnecessary inefficiencies will cost – particularly for larger sites.

VARCHAR best practice

Best practice is to set VARCHAR to the maximum necessary, not the maximum possible – otherwise, as per the above, the maximum possible is about 16000 for utf8mb4, not 255 – and nobody would propose setting it to 16000, would they? But it’s not much different, in stored row space a VARCHAR(255) requires a 2 byte length indicator just like VARCHAR(16000) would.

So please review VARCHAR columns and set their definition to the maximum actually necessary, this is very unlikely to come out as 255. If 255, why not 300? Or rather 200? Or 60? Setting a proper number indicates that thought and data analysis has gone into the design. 255 looks sloppy.

TEXT

TEXT (and LONGTEXT) columns are handled different in MySQL/MariaDB. First, a recap of some facts related to TEXT columns.

The db server often needs to create a temporary table while processing a query. MEMORY tables cannot contain TEXT type columns, thus the temporary table created will be a disk-based one. Admittedly this will likely remain in the disk cache and never actually touch a disk, however it goes through file I/O functions and thus causes overhead – unnecessarily. Queries will be slower.

InnoDB can store a TEXT column on a separate page, and only retrieve it when necessary (this also means that using SELECT * is needlessly inefficient – it’s almost always better to specify only the columns that are required – this also makes code maintenance easier: you can scan the source code for referenced column names and actually find all relevant code and queries).

TEXT best practice

A TEXT column can contain up to 64k-1 in byte length (4G for LONGTEXT). So essentially a TEXT column can store the same amount of data as a VARCHAR column (since MySQL 5.0), and we know that VARCHAR offers us benefits in terms of server behaviour. Thus, any instance of TEXT should be carefully reviewed and generally the outcome is to change to an appropriate VARCHAR.

Using LONGTEXT is ok, if necessary. If the amount of data is not going to exceed say 16KB character length, using LONGTEXT is not warranted and again VARCHAR (not TEXT) is the most suitable column type.

Summary

Particularly when combined with the best practice of not using SELECT *, using appropriately defined VARCHAR columns (rather than VARCHAR(255) or TEXT) can have a measurable and even significant performance impact on application environments.

Applications don’t need to care, so the db definition can be altered without any application impact.

It is a worthwhile effort.

Posted on 2 Comments
Posted on

On Open Source and Business Choices

Open Source is a whole-of-process approach to development that can produce high-quality products better tailored to users’ real world needs.  A key reason for this is the early feedback cycle built into that complete process.

Simply publishing something under an Open Source license (while not applying Open Source development processes) does not yield the same quality and other benefits.  So, not all Open Source is the same.

Publishing source of a product “later” (for instance when the monetary benefit has diminished for the company) is meaningless.  In this scenario, there is no “Open Source benefit” to users whatsoever, it’s simply a proprietary product. There is no opportunity for the client to make custom modifications or improvements, or ask a third party to work on such matters – neither is there any third party opportunity to verify and validate either code quality or security.

Open Source is not a marketing gimmick.  Labels such as “Open Source”, or “Enterprise”, on their own, do not have any more positive outcome than a greasy hamburger labeled with “healthy”.  If a company “believes” in Open Source software, they’ll use the open source development model for their software development.

And now we see things like this: Uproar: MariaDB Corp. veers away from open source (by Simon Phipps, InfoWorld, August 2016)

So what does it mean when a company publishes some of their software under an open source license, and does some related products under a proprietary license?  To me, it’s generally a strong indication that the company either doesn’t believe in that model, or doesn’t understand it.  And we’ve seen it before.

It also reminds me of an interaction I had many years ago.  A Marketing VP asked me “How can we leverage our [Open Source] community?”  I answered the only possible way: “One does not ‘leverage’ the community, that’s not how it works.”  Of course that wasn’t the answer the VP wanted to hear, but that doesn’t make it less true.  They saw the community as an asset to use, rather than work with.  People don’t like getting used, and in the Open Source space that’s even more true.

Companies that have turned their back on their earlier Open Source work and who have devised some other model to (arguably) make more money, have all discovered that this fundamentally changes their market.  They’ll lose some of their users, customers and supporters, and gain some new different clients.  It’s a different market.  Whether and how that pans out in terms of commercial success is never certain.  Given that we know that the Open Source development process yields benefits in terms of quality and features users want, we can say that non-OSS products lack (some of) those benefits, so to put it bluntly, it’ll be a different product of possibly less quality and the feature set is likely to differ as well.

Naturally we cannot ascertain code quality directly as we can’t review closed code directly, bug systems of proprietary software tends to be closed, changelogs are condensed for marketing purposes, but as far back as a decade and a half there have been independent studies that worked out “lines of code per software flaw” and it came out significantly in favour of Open Source software, having proportionally much fewer bugs.  Bugs also tend to get fixed quicker in Open Source software.  None of this is new(s). see for instance Open-source vs. proprietary software bugs: Which get squashed fastest? (CNET, 2007)

For complete products (libraries are a slightly different beast) with a relatively large market scope, source code being available does not in any way diminish a company’s ability to make money.  Having the core developers, tech writers and support people gives them a significant edge in the open market, and that’s a business asset you can leverage.  You do that by focusing on those aspects in your communications – that’s basic marketing, you draw attention to the positive aspects that make your company/product stand out from the rest.  Clearly, this objective cannot not achieved by force, as you don’t make a (potential) client like or trust you by denying them choice or transparency.

There is one other known option aside from not believing or not understanding, and that’s fear. But fear is an awkward business driver, it makes for very bad decisions.

MariaDB Corp in part uses the Open Source development model, in part they’re an Open Source publisher (in-house work that’s only made available at a later stage in the development process), and now some proprietary product has been added to the mix (actually new versions of an existing product).  Looking at this I am rather unclear about what they believe in.  Of course companies can make business choices as they see fit – but they never operate in a vacuum.  In the end it doesn’t matter much what I believe personally, the market will do what it will – historically, it responds in the various ways as described above.  We’ll see how it pans out.

Open Query does not recommend (or re-sell at all) proprietary tools, as it just doesn’t make sense for us or our clients.  We often do bugfixes and improvements which we contribute upstream – for proprietary tools we can’t do that and thus it becomes a hindrance for us and our clients.  On the specific practical level, we’ve actually never used MaxScale (the product that MariaDB Corp will now sell under different conditions for future versions), and this stems from our experience with its effective predecessor MySQL Proxy.  Having a complex set of scripted logic in a proxy slows down applications and introduces a rather large extra (single) point of potential failure in to infrastructure.   So, while Simon refers to MaxScale as an essential tool for scale-able environments, we know from experience that there are other ways of achieving that desired objective, and without the downsides.

Rather than promoting a single tool for many wildly different jobs, we utilise a few different tools depending on the needs of particular client infrastructure.  We still have a couple of (now legacy) MySQL-MMM deployments, but also quite a few Galera clusters, and other setups as suit our clients’ needs.  Key is to not only make the infrastructure convenient to use for applications, but also to not introduce any more single points of failure.  We build resilience into the client’s server infrastructure, without adding significant overhead in either performance or maintenance requirements.

We believe that that’s what clients want, and since potential clients come to us asking exactly for that (and note our approach with relief) we think that we’re doing the right thing by our clients.  We’ve used this approach for over 9 years, and we’ll just keep on doing that – our basic approach doesn’t change even when our tools do.  If you’d like to talk with us about helping you with your infra, using our approach and way of working, contact us today!

Posted on