Google is a juggernaut. It’s tough to imagine that in a mere fourteen years Google rocketed from a humble search engine into a leader in information technology. We talk a lot about big data, data storage, data centers, data, data, data, and Google has some seriously big data. Millions rely on Gmail for home and business email or YouTube for pleasure every day, and it takes massive data centers with hundreds of thousands of servers to run Google search and its various services.
With so many things to plan for and data spread across the world with millions of employees, what’s Google’s approach to disaster recovery? They attack themselves.
According to a recent wired.com article, Google employs a team of people called Site Reliability Engineers (SREs) whose main focus is to keep Google search and other services running. The team, which wears super-cool leather jackets with military-inspired patches, runs a simulated war on Google’s infrastructure that they call DiRT (disaster recovery testing). This “war” involves everything from causing leaks in water pipes to staging protests to attempting to steal disks from the servers—whatever it takes to bring down the infrastructure. The data center attacks aren’t real, but they are hard to distinguish from an actual event, even though the SRE team has a little fun by attributing each attack to a fictional event like a zombie, alien, or supernatural attack.
Each annual attack is headed by an engineer named Kripa Krishan. Before the attack begins, Krishnan tells the SRE team not to fix anything and that the people on the job in the data center don’t realize the team is there. Once the attack begins, the team monitors Google incident managers and measures their response times and ability to handle the issues, a neat, and fun way to test a DR plan.
Incident managers don’t realize these issues aren’t real and must handle them as though they are actually happening—sometimes even dealing with actual service failures. If the incident managers in charge of a particular site can’t stop the SRE team’s attack, however, the team can abort the attack before real users are affected. As Krishan explains, they have “become braver in how much we’re willing to disrupt in order to make sure everything works.”
Krishan explains that her role is “to come up with big tests that really expose weaknesses.” Through the information they gain from a fake attack, they know what is working, and what needs improvement. Google realizes the importance of a disaster recovery plan, and they test theirs regularly in a realistic but fun way to expose any weakness and analyze its effectiveness in real-life scenarios.