Posted on September 22, 2008 by Arjen Lentz — 11 Comments

A SAN is a single point-of-failure, too

This is a controversial angle, and when put it like this I know many people vehemently disagree with it. Late last week, a “very high end dual site / fully redundant SAN system” at Internode failed, causing serious disruption in this ISPs various services. Internode are one of Australia’s big “good guys” in ISP land and, apart from being managed by an insightful individual (Simon Hackett), they really do know their stuff technically.

I’ve called SANs “very expensive single points of failure”. Sure, they have lots of redundancy built in, and in the case of Internode it was even physically distributed across multiple data centres. Still, something went wrong. This is because there’s just an abundance of “interesting” ways to fail that are just about impossible to deal with automatically. So, SANs do have a very high uptime rating, but since big chunks of a business depend on it, any failures are quite spectacular.

Is there an alternative? Yes. I’m not sure who invented the new paradigm, but Brad Fitzpatrick at Danga/LiveJournal definitely pioneered some of it. The revolutionary thought is to accept partial failure and build your infrastructure accordingly. Eek! If you have more choices than merely “100% up” or “all down”, your architecture might look very different. At LiveJournal, some accounts may not be available for a bit, but the rest is. Think about this, a world of opportunity opens up! Apart from disaster management, this even makes maintenance much easier.

In the online world, 100% uptime for everything is just downright impossible – so: accept it, change focus, build accordingly. And in that new world, SANs simply don’t fit. They can certainly be useful in some other situations, but not nearly as often as people reckon. It’s just one of those tools that people pick by default, just like people often pick Oracle for a database, even when a flat file would do better for the purpose (see, not plugging MySQL for that, either 😉

Posted on September 22, 2008 by Arjen Lentz — 11 Comments

11 thoughts on “A SAN is a single point-of-failure, too”

anonymous
September 23, 2008

We use Sans in our environment as well, and as long as you design your storage processors and network components of your SAN with the same thought in mind (that partial failures are acceptable), then you can lay out your applications/components in such away that they are all fully contained across separate storage processors and network components. Thus, they would not be any more a single point of failure than your network (which, with enough foresight and cash, they too can be set up in the same way).

Usually, however, SANs are looked at as a ‘shared resource’, and managed in that way, instead of carefully deciding which Luns from which storage processors and raid groups get presented to which servers through which switches in terms of application failure scenarios. It takes a lot more work on the SAN admin side to work things this way, but it can be done.
laptop006
September 23, 2008

Finally someone with a brain.

As soon as you’ve integrated it down to one system, whoops you need to replicate it for when it fails.
bkarwin
September 23, 2008

I would credit Richard Hamming, an American mathematician at Bell Labs. He pioneered many concepts used in computer science and telecommunications, especially applied to error detection and correction. The concept of partial data failure instead of complete failure can be traced back to his work.

But I get it that you’re referring more specifically to application of these concepts to online application availability. 🙂
arjen
September 23, 2008

The question becomes, why go SAN at all. It’s tends to be quite costly.
Other solutions may be cheaper.
anonymous
September 25, 2008

Businesses tend to go with a SAN if they have a large number of servers that they wish to provision storage to in an efficient manner. Or at least, to have the potential to do so (not all of them actually manage it!).

Yes, it’s costly. That’s because SANs are a solution to a certain class of availability problems, and – frequently – SOHO environments don’t have those same issues.

As with anything of this ilk, getting somebody with a clue to architect it before you purchase and configure it is an essential part of having it work. You get what you pay for, whether that’s kit or implementation.

Yes, other solutions may be cheaper, but then again, the other solutions might not meet your actual requirements.
anonymous
September 28, 2008

Well, depending on how much you are prepared to spend, you can have another SAN and setup certain replication. In Oracle world we create physical standby and for MySQL, that would be master-slave replication. Right?

The balance triangle in this case would be:
– availability
– cost
– update performance
arjen
September 29, 2008

Replication can help with failover or read-scaling, it does not necessarily produce a standby server that’s current enough. Depends on the requirements, of course.
DRBD can be a good option (for Oracle too).

But, the issue is also budget. Having two SANs just doubles the cost there, plus additional infrastructure on top of that. Jeez! Plus, did you see the original post, where a multi-datacenter SAN failed….
pingback_bot
March 13, 2009

User referenced to your post from Can I have your horror-stories, please? (SANs and VMs) saying: […] and when it was all working again (if ever). Thanks! This somewhat relates to the earlier post A SAN is a single point-of-failure, too. Somehow people get into scenarios where highly virtualised environments with SANs get things like … […]
gentlemoose
March 13, 2009

(followed from your more recent article looking for horror stories)

SANs don’t have to be expensive. I wouldn’t go so far as to call that notion a myth, but certainly $25K USD per SAN including switch, HBAs, spindles, etc is not expensive in the grand scheme of things, and we’ve acquired several low-end EMC SANs for well under that price. I’m pretty sure you could easily pick up a shelf of 750G disks + controller heads for about $15K USD these days.

Also, it’s reasonable to situationally leverage existing SANs to bolster database infrastructures. For instance, we’ve dual-purposed several of our SANs which were originally purchased as large-capacity storage devices (many large, moderately fast spindles wherein the concern was space rather than speed) by carving out a few dozen gigs from each bulk storage spindle to create a very fast set of multi-spindle volumes to accommodate some of our databases. 20G * 14, 28 or more spindles across multiple shelves makes for a very fast, reasonably large storage solution for a growing, high-volume database.
Can I have your horror-stories, please? (SANs and VMs) « Open Query blog
April 2, 2009

[…] somewhat relates to the earlier post A SAN is a single point-of-failure, too. Somehow people get into scenarios where highly virtualised environments with SANs get things like […]
Luxbet, MariaDB and Melbourne Cup | Open Query blog
April 22, 2014

[…] deliver the resilience and performance required. This may seem odd, but remember that a) a SAN is also a single point of failure (so when the SAN fails, multiple db servers will be “out” – not desirable even […]

Comments are closed.

Share this:

Related

11 thoughts on “A SAN is a single point-of-failure, too”