Resilient Monkey Business in the Cloud: Netflix and the Mighty Simian Army  

Resilient Monkey Business in the Cloud: Netflix and the Mighty Simian Army  

July 22

IT service providers can learn a lot from Netflix, especially when it comes to maximizing their investment in cloud computing. Netflix is one of the most high profile users of Amazon Web Services, which it manages with some of its very own in-house resources – which technically happen to be monkeys. Yeah. Eat your heart out Planet of the Apes!

Netflix recently slapped the open source label on Security Monkey, the piece of technology that provides a safe haven for the video streaming platform and its users. Security Monkey is a potent tool that enables Netflix to monitor various environments across the Amazon cloud  for any violations or vulnerabilities that pose a threat to the service. In simple terms, it helps administrators fix security holes before attackers slip in. Aside from sealing up leaky instances, Security Monkey stays on top of SSL, DRM, and other security protocols to make sure their certificates don’t expire.

Hey Hey It’s More Monkeys! On a grander scale, Security Monkey is a member of the Simian Army, a suite of tools Netflix uses to manage its cloud-driven offering. From automating general maintenance processes to testing disaster recovery preparedness, this collection of tools helps the company keep its mission-critical video streaming service online, all the time. Let’s meet the individual members of the monkey family.

Chaos Monkey. In 2011, Netflix rolled out Chaos Monkey as the first member of the Simian Army. This unique tool tests the elasticity of the cloud by randomly simulating the failure of select instances. Administrators can configure the tool to run these tests when a live person is available to address them.

Janitor Monkey. The cloud offers oodles of resources, but that doesn’t necessarily mean everyone is making optimal use of them. Janitor Monkey combs the cloud in search of resources that aren’t being used, then promptly deletes them to help service providers save money. This particular tool can be customized to sweep up messes by applying specific rules and exceptions.

Conformity Monkey. With Conformity Monkey, Netflix created a handy utility that scans for rogue instances and other cloud critters wreaking havoc in the AWS environment. By running suspected instances against a set of best practices, the tool is able to identify those troublemakers and notify administrators via email. Like Janitor Monkey, the conformity primate is a rule-based tool that lets users easily customize and add new rules.

Latency Monkey. If you can anticipate latency, you can quickly address it before network performance becomes an issue. Netflix uses Latency Monkey to simulate delays and even full instances of downtime to test the service’s ability to survive when things get rocky. Since the service is actually still running during test, this tool is ideal for gauging the fault tolerance of individual environments without impacting the entire cloud deployment.

Doctor Monkey. Who says primates can’t perform surgery? Doctor Monkey checks into the health of each individual instance and removes those determined to be hazardous to the well being of the system as a whole. Additionally, it monitors the health of CPU, memory, and other resources allocated for usage.

10-18 Monkey. Netflix is a global sensation, so having a way to monitor the reliability of its streaming service in other parts of the world is vital. 10-18 Monkey aids the cause by checking for configuration issues and performance problems in instances that service customers across multiple countries. This tool helps ensure that Netflix runs like a champ in the U.S. and abroad.

Chaos Gorilla. As you’ve probably imagined, this one is kind of like Chaos Monkey on steroids. Bad analogy maybe, but rather than trigger artificial failures of single instances, this chaotic primate simulates the failure of an entire Availability Zone, a concept Amazon designed to prevent outages in one physical location from affecting customers in other areas. Chaos Gorilla helps you gauge your ability to bounce back in the event that disaster hits hard enough to disrupt the cloud you happen to be floating on.

As I suspected in my search for source codes, the entire Simian Army hasn’t been opened up just yet. This Gigaom article explains that additions such as Conformity Monkey and Chaos Gorilla are on the upcoming open source menu, while all new monkeys are being engineered behind the scenes. Due to the popularity of the Amazon cloud, many companies have made it a priority to develop AWS-compatible offerings that plug right in. As a result, IT service providers can harness the mighty Simian Army – at least pieces of it for now – to improve the performance, availability, and overall reliability of their deployments in AWS and other cloud atmospheres.

Photo Credit: Manfred Moitzi via Flickr