When asking about up-time requirements set down in SLAs (Service Level Agreements) with our clients’ clients, we’d hear anything ranging from hours to the familiar five nines, but these days also simply 100% and otherwise penalties apply. From my perspective, there’s not much difference between five nines and 100%, 99.999% uptime over a year amounts to a maximum of little over 5 minutes outage. In many cases, this includes scheduled outages!
So, we can just not have any outages, scheduled or otherwise. Emergency support is not going to help here, because however fast and good they are, you’re already in serious penalty time or well on your way to not having a business any more. Most will respond within say 30 minutes but then need up to a few hours to resolve the issue. That won’t help you, really, will it? And in any case, how are you going to do your maintenance? The answer is, you need to architect things differently.
I do appreciate the issue of transitioning from the corporate tradition of outsourcing the liability along with emergency support, e.g. someone to call and if need be sue… it takes time both in business processes as well as in actual architecture to make things resilient. But really, if those are the SLAs you agree on with your clients, that’s what has to be done.
Anyway, aiming for resilience (expecting things to break but building infra so that it can cope with it) rather than purchasing many-9s is I think a better focus. This because making an individual component even more reliable becomes prohibitively expensive, whereas having more servers is relatively cheap. That’s simple economics.