In a recent report in Information Week by analyst Charles Babcock, some of the glaring faults of cloud computing over the past few years take center stage.
As consumers, readers, clients, etc. we are constantly bombarded with the “miracle” that is the cloud. It’s fast, redundant, flexible… it seems like the perfect solution that will meet all of our problems and challenges.
That is… until you read about an outage at IAAS providers.
Some other headlines you might have seen in the past:
- Amazon Web Services suffered an outage in June 2012
- Amaon’s remirroring storm April 2011
- Microsoft Azure Cloud failed to recognized Feb 29, outage
- Rackspace lost transformer in 2007
So why did all of this happen?
Babcock discusses that:
Cloud reliability doesn’t have a long, well-researched history because conventional business data centers have used different architectures. That architecture is often mockingly described as “silos,” but their siloed nature had the effect of isolating one system from another. Single systems “could blow up, but there was little risk the failure would cascade across the whole data center.” says Bias.
The cloud by its design must tolerate hardware failures, and that kicks the recovery issue “out of gold-plated hard-ware,” where it once resided, and up into the software management layer, Bias adds. The cloud’s software identifies failed hardware components, recovers from the, and routes around them. But as we saw with Azure and Amazon’s management software, it either didn’t recognize software blind spots (a failed VM means defective hardware) or failures (no Feb. 29 security certificates) and thus couldn’t cope with them. In stead, it multiplied them.
Back in 2007, the Rackspace outage was initially caused by a vehicle crash which took down the voltage stepdown transformer that served Rackspace’s data center. But that wasn’t quite the end of the story.
The data center’s business continuity systems kicked in as they were supposed to. The transformer went down and the building’s UPS (uninterruptible power supply) went up, keeping servers running as emergency diesel generators came online. The building’s chillers, which cool outside air and pump it to the server racks, shut down temporarily, as they’re supposed to, then came back online as the generators revved up to provide electricity.
In other words… the data center will still running… Rackspace was still running.
The Real Failure
Firefighters were still attempting to extract the driver who crashed from his vehicle, buried into the voltage transformer. They called the power company to shut down all power in that area. They did so and all of those systems running on emergency diesel power were forced to shut down one by one, unable to keep their systems running without the main source of power.
Rackspace went down.
Today, several years later, experts understand how supposedly “single points” of failure are in fact sequences of multiple events, sometimes with human interventions adding to the problem rather than solving it. They know that automation can speed problems as well as solve them.
Though we often look at cloud as redundant and resilient, in reality it can be quite fragile. Only a few points of simultaneous failure and the network can come crashing down. Cloudscaling CTO Randy Bias, the designer of the CloudStack cloud implemented at Korea Telecom, rightly pointed out that this level of redundancy is nowhere near the level of redundancy built into something like an aircraft.
So we understand that the cloud isn’t invincible. We all know that it isn’t a perfect technology and that indeed it is still young and needs time to grow.
The problem is that the demands on cloud computing are constantly increasing. The challenge is that the improvements of cloud may not be keeping pace with the demands.
“Cloud data centers need to constantly get better, if for no other reason than the demands on them keep growing.”
It’s not only cloud computing that needs to improve, but end user planning for cloud. As reported earlier this year by Forbes in the wake of the Amazon outage, technology expert Krishnan Subramanian stressed the importance of “design(ing) for failure.” Such an approach places high value on a hybrid cloud approach, combining many cloud-based resources including data centers managed and operated by different providers.
It’s the old saying: “don’t put all your eggs in one basket.”
Life in the Clouds
The solution to moving forward? While it may not be what we want to hear, the solution is simply to accept that this is the reality and deal with it. This is the price of doing business in the cloud.
This is a constant cycle of living and learning from our mistakes. No matter how much engineers prepare for and design around the known fail points, the complexity involved makes scenarios that haven’t been planned for or even conceived of likely… perhaps even inevitable. Testing for each and every potential problem on an infinite scale, across countless servers, data centers, and networks, isn’t practical or feasible.
But just as we should expect continued problems in the cloud, we should also expect continued progress. Cloud systems are still young, so to expect to be free of problems isn’t realistic. Of course, as we develop new ways to integrate hardware systems together, new challenges and arise. Some in the industry have become cynical about providing continuous availability, as they’ve seen their list of threats grow as systems become more complex.
However, we should remember a bit of perspective. Cloud vendors are already pretty good at this new discipline, Babcock notes. For example, Google rather consistently delivers its 99.99% uptime claim across its propertise, something far in excess of the capabilities of the average enterprise. He says that of course cloud computing providers need to get continuously get better, but we should expect to see a new role – a new human skill – emerging. This role is analogous to the information security industry’s white hat hacker: while this hacker is the expert constantly spotting weaknesses and identifying failure points, the new cloud white hat will be constantly preoccupied with ensuring continuous availability.
Charles Babcock’s piece appears in the October 15, 2012 edition of Information Week Reports. It is titled “The Cloud’s Points of Failure are Showing.”