Posted on

Today’s up-time requirements

When asking about up-time requirements set down in SLAs (Service Level Agreements) with our clients’ clients, we’d hear anything ranging from hours to the familiar five nines, but these days also simply 100% and otherwise penalties apply. From my perspective, there’s not much difference between five nines and 100%, 99.999% uptime over a year amounts to a maximum of little over 5 minutes outage. In many cases, this includes scheduled outages!

So, we can just not have any outages, scheduled or otherwise. Emergency support is not going to help here, because however fast and good they are, you’re already in serious penalty time or well on your way to not having a business any more. Most will respond within say 30 minutes but then need up to a few hours to resolve the issue. That won’t help you, really, will it? And in any case, how are you going to do your maintenance? The answer is, you need to architect things differently.

I do appreciate the issue of transitioning from the corporate tradition of outsourcing the liability along with emergency support, e.g. someone to call and if need be sue… it takes time both in business processes as well as in actual architecture to make things resilient. But really, if those are the SLAs you agree on with your clients, that’s what has to be done.

Anyway, aiming for resilience (expecting things to break but building infra so that it can cope with it) rather than purchasing many-9s is I think a better focus. This because making an individual component even more reliable becomes prohibitively expensive, whereas having more servers is relatively cheap. That’s simple economics.

One thought on “Today’s up-time requirements

  1. The super high expectations of many corporations are based on unobtainable results, in reality, anything programmed by a human being is subject to an unanticipated event, and is subject to the same frailties of any human being.
    While striving for 100% accuracy and uptime may be a desired goal to strive for, the probabilities are that something will be overlooked or unaccounted for. The reality is, it will happen, failure will occur, even if it is not a piece of software.
    I agree that infrastructure organization to deal with potential outages, and built in resilience is necessary. Too many corporations raise the bar far and beyond their own capabilities as human beings, whether this is a reflection of lack of knowledge or understanding is questionable. And many are willing to sue regardless of their own abilities.
    Backend server redundancy is the only way to minimize and hopefully obtain a perceived 5 9’s or 100% uptime, the term that comes to mind is server farms connected in numbers.

Comments are closed.