Posted on

RDS Aurora MySQL and Service Interruptions

In Amazon space, any EC2 or Service instance can “disappear” at any time.  Depending on which service is affected, the service will be automatically restarted.  In EC2 you can choose whether an interrupted instance will be restarted, or left shutdown.

For an Aurora instance, an interrupted instance is always restarted. Makes sense.

The restart timing, and other consequences during the process, are noted in our post on Aurora Failovers.

Aurora Testing Limitations

As mentioned earlier, we love testing “uncontrolled” failovers.  That is, we want to be able to pull any plug on any service, and see that the environment as a whole continues to do its job.  We can’t do that with Aurora, because we can’t control the essentials:

  • power button;
  • reset switch;
  • ability to kill processes on a server;
  • and the ability to change firewall settings.

In Aurora, an instance is either running, or will (again) be running shortly.  So that we know.  Aurora MySQL also offers some commands that simulate various failure scenarios, but since they are built-in we can presume that those scenarios are both very well tested, as well as covered by the automation around the environment.  Those clearly defined cases are exactly the situations we’re not interested in.

What if, for instance, a server accepts new connections but is otherwise unresponsive?  We’ve seen MySQL do this on occasion.  Does Aurora catch this?  We don’t know and  we have no way of testing that, or many other possible problem scenarios.  That irks.

The Need to Know

If an automated system is able to catch a situation, that’s great.  But if your environment can end up in a state such as described above and the automated systems don’t catch and handle it, you could be dead in the water for an undefined amount of time.  If you have scripts to catch cases such as these, but the automated systems catch them as well, you want to be sure that you don’t trigger “double failovers” or otherwise interfere with a failover-in-progress.  So either way, you need to know and and be aware whether a situation is caught and handled, and be able to test specific scenarios.

In summary: when you know the facts, then you can assess the risk in relation to your particular needs, and mitigate where and as desired.

A corporate guarantee of “everything is handled and it’ll be fine” (or as we say in Australia “She’ll be right, mate!“) is wholly unsatisfactory for this type of risk analysis and mitigation exercise.  Guarantees and promises, and even legal documents, don’t keep environments online.  Consequently, promises and legalities don’t keep a company alive.

So what does?  In this case, engineers.  But to be able to do their job, engineers need to know what parameters they’re working with, and have the ability to test any unknowns.  Unfortunately Aurora is, also in this respect, a black box.  You have to trust, and can’t comprehensively verify.  Sigh.

Posted on