Five nines reliability standard

Share on Google+Tweet about this on TwitterShare on FacebookShare on LinkedInPin on PinterestEmail this to someone
HPC

Uptime. It’s all about uptime; ask any sysadmin.

Components fail and networks go down. Power goes out. Users download viruses on to systems. Apophis could go to eleven on the Torino impact hazard scale, smack right into Euro Disney, and the call from marketing would be to inquire when the servers are going to be back up.

The Black Swan has a nasty habit of smacking us around, so we come up with contingencies and try to engineer redundancy. Redundancy has costs but the question remains, how critical is the underlying system? Can you afford, in the grand scheme of things, for that component to fail?

RAID, quad-port LAN, Twin blades, dual UPS, remote backups. Every layer adds complexity and cost but if reliability is the goal for mission critical applications, then those complexities have value.

Achieving five nines reliability, for most, is impractical and cost prohibitive. Vendor claims of five nines do not distinguish between availability and reliability. Availability means the total  amount of time the product was up. Reliability means the number of times a product went down. One power outage means your system is reliable but unavailable.

The maximum component downtime, to be considered five nines reliable, cannot exceed 5 minutes 35  seconds a year. Five nines availability requires the problem be resolved in less than five seconds. This is where you try to justify the expense of backups in your budget meetings.

What does downtime cost? Depends on what alternatives are available to your customers in the event of single component failure.

How much is too much? That’s what the sysadmin is for.

Leave a Reply