This is a controversial angle, and when put it like this I know many people vehemently disagree with it. Late last week, a “very high end dual site / fully redundant SAN system” at Internode failed, causing serious disruption in this ISPs various services. Internode are one of Australia’s big “good guys” in ISP land and, apart from being managed by an insightful individual (Simon Hackett), they really do know their stuff technically.
I’ve called SANs “very expensive single points of failure”. Sure, they have lots of redundancy built in, and in the case of Internode it was even physically distributed across multiple data centres. Still, something went wrong. This is because there’s just an abundance of “interesting” ways to fail that are just about impossible to deal with automatically. So, SANs do have a very high uptime rating, but since big chunks of a business depend on it, any failures are quite spectacular.
Is there an alternative? Yes. I’m not sure who invented the new paradigm, but Brad Fitzpatrick at Danga/LiveJournal definitely pioneered some of it. The revolutionary thought is to accept partial failure and build your infrastructure accordingly. Eek! If you have more choices than merely “100% up” or “all down”, your architecture might look very different. At LiveJournal, some accounts may not be available for a bit, but the rest is. Think about this, a world of opportunity opens up! Apart from disaster management, this even makes maintenance much easier.
In the online world, 100% uptime for everything is just downright impossible – so: accept it, change focus, build accordingly. And in that new world, SANs simply don’t fit. They can certainly be useful in some other situations, but not nearly as often as people reckon. It’s just one of those tools that people pick by default, just like people often pick Oracle for a database, even when a flat file would do better for the purpose (see, not plugging MySQL for that, either 😉