Nov
6

The Birth of HeadStart Restore and VirtualBoot

The Birth of HeadStart Restore and VirtualBoot

November 6
By

StorageCraft’s core philosophy relates to doing everything possible to minimize the information-access downtime window imposed by the traditional disaster-recovery operation. Essentially, we don’t want downtime at all, but at the beginning, although we had some great technologies, not every single piece of the puzzle was in place. We were missing a few tools that were necessary for eliminating downtime that can cause companies to lose tons of money or even close their doors for good.

Let’s take a trip into the past and have a look at exactly what led us to create the features that made it possible for our partners to eliminate as much downtime as possible, and to walk away from the disasters that left other businesses utterly crippled.

First, it’s worth noting that at StorageCraft we’ve always used our own software on our production servers (eating our own dog food, as the expression goes). This has made us more aware and responsive to issues and feature deficiencies. Many years ago, we discovered that even though our software worked great for backup and restore, the restore operation took a very long time when terabyte-sized volumes were involved (this took many hours or even days with massive volumes). It’s a simple matter of device bandwidth. It takes a long time to move trillions of bytes from device A to device B—that’s just the way it is.

Clearly, we knew we could do better, and this downtime window was begging for a solution, and especially since around that time Gartner had come out with a report about the costs of downtime that really cast a spotlight on the issue. What bothered me most about Gartner’s report were the high percentages of businesses that would be completely out of business within a year or so after experiencing a significant enough downtime window. It was striking to learn that even if a business’ restore worked perfectly, the downtime window itself could be almost just as devastating.

At that time we still had the ability to mount volumes, which gave our users near-instant access to the file contents of their backups.  Of course, the files alone were only useful if the applications using those files were available, and they often required their original settings and the operating system. We needed new technologies to make their entire data ecosystem (data, apps, settings, and OS) available in the absolute minimum amount of time. Realizations like this help us understand problems. When we understand the problems, we’ve got a great catalyst for innovation. The result of this was our patented staged-restore technology called HeadStart Restore (HSR), as well as our VirtualBoot technology.

VirtualBoot was introduced in the initial 4.0 release of ShadowProtect as part of the installed ShadowProtect product (not in the Recovery Environment). HeadStart Restore (HSR) was released around the same time as a feature of ImageManager, though the HSR functionality is also available within the booted Recovery Environment.

VirtualBoot is part of the installed ShadowProtect product because WinPE boot sessions are limited in duration to 24 hours by Microsoft (or 3 days, IIRC, for the latest flavor). Our users often use VirtualBoot for failover scenarios so they needed to be able to use the functionality hosted from a platform that wouldn’t vanish from under them every day or two.

Note that mounting and VirtualBoot features of ShadowProtect are fully operational even if the product itself is an expired trial. This is intentional, and here’s a common use case that illustrates why.

Let’s say the customer had an Exchange server with 10TB of data and ShadowProtect is installed and configured to backup all volumes every 15 minutes. The customer then experiences a disaster where the Exchange server hardware completely fails. Rather than wait for the 10TB restore to a replacement machine, the customer simply uses any available computer as a host for a VM. They install ShadowProtect in trial mode onto that machine, and then point the ShadowProtect VirtualBoot tool at the latest backup of their Exchange server and tell VirtualBoot to boot that most recent backup.

After three minutes or so, the VM has booted and within the VM ShadowProtect is still installed (because the VM has everything in it that was in the original Exchange server – all data, apps, settings, and the OS), and within the VM ShadowProtect continues to generate fast, low-impact incrementals, protecting any new emails managed by the Exchange server in the VM and extending the existing backup image file chain with new .spi incremental images.

Customers may not want their Exchange server to be permanently hosted in this manner (although performance is very good thanks to the image data being compressed – it can even exceed the performance of the original machine if the image chain is short enough, which ImageManager helps to ensure), and we provide them with all the tools they need to migrate out of the VM to any other machine, again with minimal downtime. This is where HSR comes in.

HSR can be used to begin restoring a chain of backup image files while new .spi incrementals are still being added to the same chain. So let’s say that the original Exchange server, which died, was a Dell.  The Dell dies, and the customer grabs an IBM machine, installs ShadowProtect on it and runs VirtualBoot and gets their Exchange server back up quickly (no need to move TB of data from A to B for the VirtualBoot). Finally the customer buys an HP server as the final location for their Exchange server. They keep the VM online and functional as their Exchange server and simultaneously they start the stated HSR restore of the backup image chain used by the VM, staging a restore to the HP.

Eventually, the HSR staged restore will catch up to the point that it is applying the very latest incremental .spi files being generated by the VM. At that point, the VM can be taken offline and the HP can be brought online. This is a few minutes of downtime for the HP to boot, but the key is that this event can be done at a planned time, such as Sunday at 2:00am. So with the combination of VirtualBoot, and HSR, you have the ability to recovery from a full disaster with a few minutes (boot time for the VM) of unplanned downtime (vs. hours or days), and a few more minutes (boot time of the final machine – HP in the above example) of planned downtime to the final recovery host.

In practice this combination of VirtualBoot and HSR has proven to be very effective. To our great benefit, we actually needed it ourselves after several of our production servers experienced a group failure. None of our personnel (aside from our IT department) even knew about the event because our tools were used to mitigate the downtime. We’re lucky we had the tools in place when that failure occurred, otherwise StorageCraft may not have existed today.

Now, there was still a large missing part of the minimal downtime window story. We had solved the downtime window for individual machines, but we hadn’t solved the downtime window for an entire site failure. If an entire server room burned down we didn’t have a simple and easy story to bring it all back in a couple of minutes. This is what we have solved with the release of StorageCraft Cloud Services.

I feel it’s the most comprehensive disaster recovery solution, particularly when it comes to businesses and “actually working” functionality. Customers can configure ImageManager to replicate their backup image files offsite, either to SFTP servers or WAN-optimized ShadowStream servers or to StorageCraft Cloud Services. When customers replicate their backup to our cloud, they can easily logon to the cloud website and mount any of their backups for instant file-level recovery.

More exciting, though, is the ability for them to spin-up any of their backups as a VM within the cloud. They can even tie together multiple booted image-based VMs with VPN and their own subnets, perfectly reproducing their original network room. The story is quite compelling.

The customer’s site can fail entirely and within a couple of minutes all of their machines can be available again using a simple web interface. We even provide the tools necessary for them to fail back to their physical site when they’ve rebuilt it. Taking the protection of their data to the next level, we also offer the option for our customers to mirror their cloud data so that even if a regional disaster destroyed the facilities of customers nearby the first data center, their data would still be safe and instantly usable from the other side of the continent. It will take a truly continental-wide disaster to cause a significant information-access downtime window. I suspect if that occurs they’ll have other things to worry about than checking their email.

That was kind of a long rant. This just happens to be one of my favorite topics. People often neglect to consider how important that downtime window is, particularly for businesses, and I feel we have stellar solutions for this.

Image Credit: Umbris via Wikimedia