A SAN is a single point-of-failure, too

This is a controversial angle, and when put it like this I know many people vehemently disagree with it. Late last week, a “very high end dual site / fully redundant SAN system” at Internode failed, causing serious disruption in this ISPs various services. Internode are one of Australia’s big “good guys” in ISP land and, apart from being managed by an insightful individual (Simon Hackett), they really do know their stuff technically.

I’ve called SANs “very expensive single points of failure”. Sure, they have lots of redundancy built in, and in the case of Internode it was even physically distributed across multiple data centres. Still, something went wrong. This is because there’s just an abundance of “interesting” ways to fail that are just about impossible to deal with automatically. So, SANs do have a very high uptime rating, but since big chunks of a business depend on it, any failures are quite spectacular.

Is there an alternative? Yes. I’m not sure who invented the new paradigm, but Brad Fitzpatrick at Danga/LiveJournal definitely pioneered some of it. The revolutionary thought is to accept partial failure and build your infrastructure accordingly. Eek! If you have more choices than merely “100% up” or “all down”, your architecture might look very different. At LiveJournal, some accounts may not be available for a bit, but the rest is. Think about this, a world of opportunity opens up! Apart from disaster management, this even makes maintenance much easier.

In the online world, 100% uptime for everything is just downright impossible – so: accept it, change focus, build accordingly. And in that new world, SANs simply don’t fit. They can certainly be useful in some other situations, but not nearly as often as people reckon. It’s just one of those tools that people pick by default, just like people often pick Oracle for a database, even when a flat file would do better for the purpose (see, not plugging MySQL for that, either ;-)

10 Responses to “A SAN is a single point-of-failure, too”

  1. We use Sans in our environment as well, and as long as you design your storage processors and network components of your SAN with the same thought in mind (that partial failures are acceptable), then you can lay out your applications/components in such away that they are all fully contained across separate storage processors and network components. Thus, they would not be any more a single point of failure than your network (which, with enough foresight and cash, they too can be set up in the same way).

    Usually, however, SANs are looked at as a ‘shared resource’, and managed in that way, instead of carefully deciding which Luns from which storage processors and raid groups get presented to which servers through which switches in terms of application failure scenarios. It takes a lot more work on the SAN admin side to work things this way, but it can be done.

  2. Finally someone with a brain.

    As soon as you’ve integrated it down to one system, whoops you need to replicate it for when it fails.

  3. I would credit Richard Hamming, an American mathematician at Bell Labs. He pioneered many concepts used in computer science and telecommunications, especially applied to error detection and correction. The concept of partial data failure instead of complete failure can be traced back to his work.

    But I get it that you’re referring more specifically to application of these concepts to online application availability. :-)

  4. The question becomes, why go SAN at all. It’s tends to be quite costly.
    Other solutions may be cheaper.

  5. Businesses tend to go with a SAN if they have a large number of servers that they wish to provision storage to in an efficient manner. Or at least, to have the potential to do so (not all of them actually manage it!).

    Yes, it’s costly. That’s because SANs are a solution to a certain class of availability problems, and – frequently – SOHO environments don’t have those same issues.

    As with anything of this ilk, getting somebody with a clue to architect it before you purchase and configure it is an essential part of having it work. You get what you pay for, whether that’s kit or implementation.

    Yes, other solutions may be cheaper, but then again, the other solutions might not meet your actual requirements.

  6. Well, depending on how much you are prepared to spend, you can have another SAN and setup certain replication. In Oracle world we create physical standby and for MySQL, that would be master-slave replication. Right?

    The balance triangle in this case would be:
    - availability
    - cost
    - update performance

  7. Replication can help with failover or read-scaling, it does not necessarily produce a standby server that’s current enough. Depends on the requirements, of course.
    DRBD can be a good option (for Oracle too).

    But, the issue is also budget. Having two SANs just doubles the cost there, plus additional infrastructure on top of that. Jeez! Plus, did you see the original post, where a multi-datacenter SAN failed….

  8. User referenced to your post from Can I have your horror-stories, please? (SANs and VMs) saying: [...] and when it was all working again (if ever). Thanks! This somewhat relates to the earlier post A SAN is a single point-of-failure, too. Somehow people get into scenarios where highly virtualised environments with SANs get things like … [...]

  9. (followed from your more recent article looking for horror stories)

    SANs don’t have to be expensive. I wouldn’t go so far as to call that notion a myth, but certainly $25K USD per SAN including switch, HBAs, spindles, etc is not expensive in the grand scheme of things, and we’ve acquired several low-end EMC SANs for well under that price. I’m pretty sure you could easily pick up a shelf of 750G disks + controller heads for about $15K USD these days.

    Also, it’s reasonable to situationally leverage existing SANs to bolster database infrastructures. For instance, we’ve dual-purposed several of our SANs which were originally purchased as large-capacity storage devices (many large, moderately fast spindles wherein the concern was space rather than speed) by carving out a few dozen gigs from each bulk storage spindle to create a very fast set of multi-spindle volumes to accommodate some of our databases. 20G * 14, 28 or more spindles across multiple shelves makes for a very fast, reasonably large storage solution for a growing, high-volume database.

  10. [...] somewhat relates to the earlier post A SAN is a single point-of-failure, too. Somehow people get into scenarios where highly virtualised environments with SANs get things like [...]

Leave a Comment