Saturday, September 19, 2009

Lousy HA is Not Necessarily Better Than No HA

In fact, it can be worse.  And no, I’m not just being a retentive purist.  Hear me out.

High-Availability is the ultimate goal of most corporate IT departments.  It’s annoyingly measured and spouted by counting the “number of nines” that your HA solution delivers.  And there is no shortage of technologies designed to help you obtain your uptime goals: Multipathed SANs, Mirrored Multipathed SANs, Clusters, Geographically-Dispersed Clusters, ESX Farms….the list goes on.  For the most part, the vendors of these technologies aren’t foolish enough to claim that their solutions alone are enough to deliver on the promise of uninterrupted uptime, but somewhere along the line all of the HA-hype has done a number on the collective consciousness of the IT world. 

“You’re running a cluster?  You’re lucky – you must sleep well at night!”


At one time, there was an entry barrier into the world of high-availability that stopped most companies from venturing into this space: cost.  Quite simply, it cost a bloody fortune to throw together, for instance, an MSCS cluster.  For starters, the entire cluster solution had to be certified for running an MSCS cluster.  Not the individual components, but the combination of components, right down to your SAN firmware version.  That reduced your choices to a handful of (usually very expensive) solutions.  But this wasn’t so bad – it just weeded out “real” availability requirements from “fake” ones.  Application owners could request five nines of availability, but when they saw the price tag, suddenly they could somehow afford more than five minutes of downtime per year.  That’s not to say that all HA requirements were BS – just that costs and budgets kept requirements in check.

Thanks to the incessant advancement of tecknowledgee, HA solutions are becoming more commonplace.  For instance, with Windows Server 2008 Failover Clustering, you no longer need a certified cluster solution.  Sure, each component needs to be suitable for a cluster, and the entire cluster needs to pass certain validation tests, but the requirements are significantly relaxed from the days of MSCS clusters.

This is great news, right?  It’s the modern day equivalent of a chicken in every pot!  A cluster in every server room!  What’s that?  You don’t have a server room?  Well, a cluster in every supply closet!

OK, I might be getting a bit carried away, but you get the picture.

So where’s the problem with all of this?  I mean…isn’t cheaper better?  And of course it is, I say.  But we can’t get ahead of ourselves. 

There is no shortcut to high-availability. 

Before we start promising to deliver uptime out the wazoo, we have to learn to think for high-availability.  From a 40,000 foot view, it’s hard to imagine that implementing a high-availability mechanism can actually cause damage, but it can.  Let’s take a look at some of the side-effects of half-assed HA implementations, shall we?

1. You can actually reduce the uptime of your applications
Say what?  How can implementing a HA technology reduce the uptime of your applications?  Quite easily, actually.  Let’s take a typical “dual everything” server.  Dual NICs, dual power supplies, dual RAID controllers, a direct-attached RAID array.  You can survive the failure of most of the “fragile” bits of your server, replace the offending hardware, and be back in business.  But there are a few single points of failure…the OS for one, the motherboard.  This isn’t ideal – we’ll never make five nines like this!  We need more HA!  We need (drumroll please) a cluster!

So you implement a “simple” active/passive 2 node cluster.  A couple of servers, or more commonly blades.   An iSCSI or fibre channel SAN.  You roll it into production, and everything is beautiful.  Set it and forget it, right?

Wrong.  Do you have anyone qualified to manage that SAN?  Do you have anyone who knows a cluster from their elbow?  Are you monitoring your new technological valhalla?  No?

OK…so let’s say you neglected to enable multipathing on the SAN.  Now you have far more single points of failure than you did with your standalone server.  Or let’s say that, in trying your darndest to create a new LUN, you end up clearing the SAN configuration.  Or a routine firmware flash fails, and your SAN won’t come online.  Or let’s say that you’re not monitoring cluster failover events, so you don’t even notice that node 1 went offline in the middle of the night.  Trust me…you’ll notice when node 2 does the same.

You see, a wise person once said “complexity is the enemy of security”.  An equally wise person stole borrowed the phrase and applied it to availability – for complexity is indeed the enemy of availability.  Without the skills and infrastructure to back up a HA solution, the complexity that you’re introducing can actually decrease your availability!

2. You can cripple your DR strategy
Half-assed HA can hurt your disaster recovery strategy as well, in two important ways:
  i. “We have (clustering/SAN mirroring/multipathing)!  We don’t need a DR strategy!”. 
Wrongo.  Nothing replaces backups, and even the best HA solution will call on backups from time to time.  Don’t believe me?  Well, how does your cluster help you when your database is suspect?  Toldja so.
  ii. “We’re backing up to a network share every five minutes!”.  Great!  Did you check to make sure that said “network share” isn’t using the same storage subsystem that your production servers are hosted on?

3. You can create false confidence
A business that has faith in it’s “highly available” infrastructure will learn to lean on it more and more.  And this is a good thing – it means that the technology we all work so hard to implement and maintain is paying dividends.  But to your users, promised HA is the same thing as real HA.  They trust that, when you promise them 99.999% uptime and <5 minutes data loss, you know what you’re talking about.  So much so that they may choose not to develop backup plans should the unthinkable happen.  And when it does happen, the business can be seriously injured (or even destroyed), because they can’t meet their contractual obligations/can’t meet reporting deadlines/can’t ship their product.  Think about it before you promise the sun, the moon, and the stars, because…

4. You can be out of work
You can be the CIO’s poster child, but if you promise something you can’t deliver, and the fallout seriously impacts the business, you had better have your CV up to date.  Have you really thought about what it takes to deliver any measure of high-availability?  Let’s take a typical SQL Server database application.  What does it depend upon?  For starters, the obvious: the availability of the database.  Which depends upon the OS and physical hardware being up and functioning.  That’s it, right?  Not  quite.  How about:
  - The physical network connecting your application to your clients
  - The application servers and/or terminal servers, and all of their dependencies
  - DNS, DHCP, and Active Directory (what good is an application if your Windows Authenticated users can’t log in?)
  - The security of your application infrastructure (an application that is down because of a hacker or a disgruntled employee is no more available than an application that is down because of an infrastructure failure)
  - The power that runs the whole shebang

And there may be more, depending upon your environment.  “But wait!” you say, “Those things aren’t my problem!  I’m just a DBA!”.  True, you may be “just a DBA”, but have you documented your dependencies on items that are outside of your control?  Have you obtained SLAs from your network admins, your security admins, your SAN admins… support the SLA that you delivered to your application owners?  Trust me – the corporate chopping block will be much more sympathetic to your plight if you have already documented your dependencies on external factors, before the proverbial excrement hits the fan.  Any amount of finger pointing after the fact comes across as just that – finger pointing.  And good managers don’t brook lousy excuses.


Now, let’s not misconstrue the message.  I’m not saying for a second that you shouldn’t try to implement highly available infrastructures in your environment.  Nor am I saying that you’re an idiot if you have rolled out a half-assed infrastructure.  The message I’m trying to impart is that you need to examine your “highly available” solution from every possible angle before a disaster makes you wish that you had.

See you next time.


SQLRockstar said...

Brilliant post Aaron. I especially like the parts about how HA is not the same as DR; too often those concepts are blurred or confused.

dledwards said...

"Aaron Alton, the HOBT, says lousy HA is not necessarily better than no HA. Ha! [...]"

Log Buffer #163

Hugo Shebbeare said...

Nice points Aaron, thank you. Congrats on the invite to Adam is great guy, met him several times and he even sent me his Expert SQL programming - great read.
I have spent a fair amount of time on Disaster Recovery over the past year, and here's a method I used at CDP Capital (now generic, for SQLbackup):

Aaron Alton said...

Hi Hugo,

Thanks for the comment. Looks like you've published a great (and lengthy) article - I now have my work cut out for me tonight ;)