One of the scariest things about trusting your business to the cloud is the potential for failures you can’t easily control. One vendor’s technical difficulties could take out core systems, or your entire IT infrastructure. The mighty Amazon knows all about it.
I imagine Amazon customers were doing sailors proud amid the recent AWS outage. The popular cloud computing service suffered two failures over four days in late September 2015, and caused disruptions for some pretty high-profile clients in the process. IMDB, Netflix, and Reddit were among those who suffered at least some downtime when issues with Amazon’s DynamoDB database resulted in widespread failure.
This wasn’t the first time AWS experienced an outage. A previous glitch bled the company of a reported $1100 per second back in 2013. While this certainly won’t be the last debacle, there is plenty for MSPs to takeaway from the latest AWS outage.
1. Downtime Happens
If the Amazon cloud can fall, then any system is capable of crumbling in a heartbeat. The occasional hiccups AWS has experienced over the years should serve as a reminder that whether it’s due to buggy code, intense bottlenecks or security breaches, no system is 100 percent safe from failure. Reliability is even more questionable in cloud environments due to the sheer nature of the beast. Understand the risks, strive for resilience, and diversify your IT infrastructure by not putting all your eggs (or systems) in one basket.
2. Stay Ready For the Worst
Organizations from the financial sector to the entertainment industry depend on AWS for IT resources. After all, it is run by largest commercial cloud service provider. I’d be willing to bet that the least affected clients were the best prepared clients. Disaster preparedness aims to sure up areas such as:
- Personnel training: Ensure staff is clear on their roles and responsibilities during recovery efforts.
- Core systems: Make sure all IT systems meet the requirements of your disaster preparedness strategy.
- System components: Check to ensure all hardware and software components are working according to disaster recovery plans.
- Vulnerabilities: Check your infrastructure to uncover any weaknesses that might hinder recovery efforts.
- Continuity: Evaluate existing recovery strategies to better ensure that your systems are operational and that business continues following a disaster.
3. Down Doesn’t Always Mean Out
Netflix managed to weather the storm during the AWS failure by strengthening the resilience of its cloud space with tools like Chaos Monkey. A member of the Netflix Simian Army, Chaos Monkey purposely kills services to simulate the failure of an entire AWS availability zone. In the case of the last outage, experience with this suite of chaotic primates gave Netflix the savvy to redirect traffic to away from the troubled zone to a region of the Amazon cloud not affected by the problem. This strategic move proves that it’s possible to keep things rolling with little to no disturbances in spite of an outage.
4. There’s No Passing the Buck
Amazon is a major IT service provider with extensive service level agreements that detail the company’s commitment to the customers using its platform. While a cloud provider shoulders the burden and has a degree of promise to uphold, an MSP is responsible for servicing its own clients through rain, sleet, snow, or cloudy weather. If an outage cripples your core managed services, clients will not be calling Amazon for customer service (like that’s even possible) looking for answers. They’ll be looking to you!
5. It Is What It Is
Some might say the latest AWS episode is yet another knock on the cloud. That it’s another reason to steer clear of this smoking hot phenomenon.
But the reality is that outages are a way of IT – cloud or no cloud. Contrary to what it’s often hyped as, the cloud is not some bullet-proof solution to business problems. MSPs have to realize the importance of establishing the architectural foundation that was laid in physical data center environments many years ago before they even consider a move to any cloud infrastructure.
I wouldn’t necessarily say the Amazon’s run with outages is an indictment of the cloud, but it is a wake-up call of sorts. The benefits of on-demand infrastructure scaling require users to make a serious compromise. And even if a cloud provider is on the hook for keeping the lights on, you are just as responsible as any party for keeping your own customers online. Class dismissed!