Disaster Recovery Lessons from BYU’s Hardware Failure

Disaster Recovery Lessons from BYU’s Hardware Failure

September 28

It has been a rough start to the 2015 school season for BYU.

The football team lost starting quarterback Taysom Hill to a season-ending injury in the very first game of the season, and a blowout loss to the Michigan Wolverines three games later. Then there was what some might call a fumbled disaster recovery effort by the campus IT office.

On September 3, 2015, a major hardware failure in one of its core IT systems resulted in widespread chaos across the BYU campus. A combination of outages and sluggish academic websites was the source of frustration for students, school faculty, and probably IT personnel tasked with restoring order.

“We’ve had outages and slowness on different websites affecting a significant portion of campus,” said Todd Hollingshead, a spokesman for the school.

Learning Suite, BYU’s web-based learning management system, was the core system in question. The untimely disruption had several students fearing that they wouldn’t be able to meet deadlines for homework and other school assignments. In addition to putting assignments in jeopardy, the hardware failure prevented a handful of students from punching in and out of campus job sites. Even teachers had difficulty updating assignments and posting grades in the system.

Disasters don’t come with a warning, but there are some signs that suggest BYU was ill prepared for this one. The school’s untimely system crash presents a good opportunity to revisit some crucial lessons in disaster recovery.

Get Your Priorities Straight

Access to some BYU systems was reportedly spotty during the disruption period, creating a conundrum of sorts for students who had been monitoring the situation in hopes of a timely resolution. One student was even able to get into the school system, yet was didn’t want to risk proceeding with the class quiz if it would just crash again a few minutes later.

“I thought Learning Suite might freeze and the quiz might close for good without giving me a score,” said Preston Alder, a junior majoring (ironically) in Business Strategy.

Surely BYU had a backup plan. But the aforementioned stability issues may indicate that the institution didn’t necessarily have its priorities in order. A good disaster recovery plan entails exactly what needs to done in the time of crisis. From hardware to data, you must identify the resources that are necessary to keep your operations up and running, and which services should be restored first. This process of prioritizing will help IT determine how often backups need to be performed, in what order data should be restored, and make other critical decisions that streamline recovery efforts.

Think Fast

virtualization photo

BYU’s systems were down for roughly a full day. This was a huge deal for the students who didn’t know if they would be able to meet assignment deadlines or whether professors would show them mercy in the event that downtime exceeded their due dates. Imagine your business being unavailable for even close to 24 hours then you’ll understand why disaster recovery should strive to limit downtime to the absolute bare minimum.

Sure, IT should work diligently to get systems back online as soon as possible. However, it helps when they’re equipped with the right tools. Virtualization can optimize recovery in a couple of ways. VMware offers software that automates recovery and the data migration from point A to point B so there is little to no manual intervention required. Virtual servers can also serve of your recovery sites on legacy hardware that can momentarily act as your primary sites in the event of complete failure.

Learn From Disasters of the Past

The September hardware failure may have been a cold batch of deja vu for BYU professors and maybe even a few students. It was on Memorial Day in 2012 when a failed software upgrade wrecked havoc on the school’s main IT system. This failure was arguably an even bigger disruption because terabytes of data were lost in the process, most of which the IT team failed to recover from the backups generated for that very purpose. Several months and millions of dollars were lost before BYU made a full recovery.

A poorly designed backup plan turned out to be BYU’s biggest weakness when Memorial Day mayhem crippled its IT infrastructure. While it’s fairly easy to criticize the handling of the recovery process, the fact that all systems were fully operational the following day shows that the IT team took something from the previous disaster. Whether it’s having backup equipment ready to stand in for failed hardware or restructuring the crisis crew, you have to takeaway something that helps you improve disaster preparedness in the future.

Bulletproof Your Backup Plan

The recent bout of uncertainty on the BYU campus just goes to show what can happen when disaster strikes. All parties had their patience tested, but it was merely a minor hiccup compared to some of the massive system crashes in IT history. Outages can become extensive to the point where even your backup sites located two cities over are out of commission. Your backup plan can never be too bulletproof and when it comes disaster recovery, the cloud can literally be a saviour for your business.

The most compelling advantage of cloud computing for disaster recovery can be seen in how it gives small and medium-sized businesses the flexibility larger companies have traditionally enjoyed. Many bigger firms have the benefit of secondary facilities they can literally use as backup data centers. Thanks to the cloud, smaller companies have the luxury to fail-over entire systems to a remote site when disaster strikes. Just make sure the vendor offers a geographically dispersed cloud so they’re less likely to be crippled by the same disaster.

Account for the Human Element

When disaster strikes: Assessing workforce readiness around the world
Infographic by Mercer Insights

The human element doesn’t automatically imply the context we often use it in. Nor am I suggesting that someone on the BYU IT staff goofed up. Maybe a couple of key employees take it upon themselves to rescue an elderly couple from the flood out in the parking lot. Or maybe the IT manager has to take off to make sure their kids are safe and sound while the local area is in a state of emergency. The point is that personal emergencies have a way of undermining the most seemingly fail-proof strategy. You can’t exactly anticipate any one instance, but you can plan for the unexpected by accounting for the human element in your disaster recovery plan.

According to Mercer, roughly 50 percent of companies sort of followed their disaster preparedness plan to the letter. Nearly 20 didn’t percent follow the plan at all. Your ability to stay ready may determine whether your disaster recovery efforts resemble the BYU that gave students a minor scare, or the BYU that directly cost itself a small fortune.

Photo by Ken Lund