The Azure Failure Snowball

The Azure Failure Snowball

March 4

It must be tough to be a big business. They’ve got an awful lot to take care of. They need to address everything from patent infringements, to product development, support, and company softball leagues. Yet, it’s hard to feel bad for them when they do something really stupid.

Azure was declared the number one cloud provider in the world because of its speed, availability, uptime, and because it had never registered an error. However, in an ironic twist, three days after receiving the award, the service crashed and shut down service to thousands of users worldwide.

little boy building snowman in winter park

Leave it to clouds to ruin a sunny day.

Azure Failure Due to Human Error

But what would cause such a large outage? Surely it must be massive hardware failure? A cyber-attack? Aliens? Actually, just like the recent outage of cloud competitor Amazon Web Services, Windows Azure crashed because of simple human error.

A secure sockets layer (SSL) certificate expired, and this caused the disruption, according to the Windows Azure blog. The shutdown affected a number of Azure services that are dependent on storage. This meant that users of 52 different Microsoft services, everything from email clients to Xbox Live, were without service for several hours. Microsoft had to provide credits to customers in accordance with their SLA, to make up for the inconvenience.

It’s not clear exactly how many users were affected, but the cost of crediting them will be an awful expense. As always, it’s often more important to think about the little things that can go wrong because one little thing can set off a fire-cracker fuse of issues.

How a Large Company Failure Snowballs

Because of Microsoft’s one little oversight, thousands of users were without service for as long as twenty-four hours. The effect on Azure customers even trickled down to their customers. For example, StorageCraft missed sending an important customer communication through our Azure-hosted email client. Microsoft’s failure seeped all the way down to us, even though we don’t even use Azure directly.

(Luckily for us, as a backup and disaster recovery company, we were ready to send out the communication via alternate (dare I say “backup”) methods and the communication was sent as planned.)

Compared to big issues, there are countless more little things that can happen and forgetting which ones need to be handled can cost money. Not thinking ahead can really take a toll.

The takeaway here is: remember the little things, they can cause big problems down the road.