A few weeks ago, Amazon Web Services, IBM and Rackspace rebooted their cloud infrastructure to deal with a maintenance issue due to a known vulnerability in the Xen hypervisor. I began seeing angry customers taking to Twitter to express their frustrations with the reboot, although each cloud provider claimed the reboot caused minimal downtown for most customers.
IT managers and cloud providers find themselves in a catch-22 position. They want to provide customers with ample notice that maintenance will result in downtime. But when the vulnerability is of a serious nature, IT wants to apply the patch as ASAP, and avoid telegraphing their intentions to those who might exploit the vulnerability.
Having to apply security patches and the resulting rebooting process is something with which we are all familiar. This week I decided to look at what we can learn from these large cloud reboots as well as determine what best practices exist for those of us responsible for smaller, but no less critical, server farms.
I spoke with a friend of mine who manages about 10 Microsoft Exchange Servers for Microsoft. He said that each server takes about ten minutes to be fully operational after a reboot. And when you’re handing corporate email, even that small amount of downtime can cause problems. Let’s take a look at a number of best practice when it comes to rebooting business critical servers.
- Communicate Your Plan – Even if you’ve planned the reboot for the weekend. You don’t want employees calling you to complain they can’t reach email on a Saturday evening when you’re in the middle of a reboot. Communicate early and often.
- Automate What You Can – Know what servers you can script the reboot and those you can’t. I found that many IT manager will script the reboot of File/NAS servers and web servers, but not the reboot of AD, Exchange or SQL Server, due mainly to their complexity and network dependencies.
- Stagger Reboots – My friend doesn’t bring all ten Exchange servers down at the same time. Instead he staggers the reboots so that not everyone’s mail is down at the same time. This can also help to alleviate pressure on the internal power grid.
- Manage Expectations – Applying patches can be tricky, especially on servers running a multitude of processes. Anticipate that problems will arise and plan accordingly. It’s better to under-promise and over-deliver when communicating an ETA to users.
- Implement a Maintenance Schedule – The actual schedule will depend on the individual business, but Saturday or Sunday evenings tend to be popular times for regular server maintenance. Having a schedule avoids patches piling up and potentially causing unforeseen issues. It also trains your users to expect some downtime to the point where it becomes routine. I know every Sunday evening that my website might be inaccessible for a few minutes while Bluehost applies patches to their webservers. I appreciate them taking steps to insure my site is as secure as it can be and that will occasionally require downtime while patches are installed.
We’ve all heard the folklore stories of the Novell or NT server that was isolated from IT due to a remodel, yet the server continued to run for years without any oversight. Such stories make for memorable water-cooler chatter, but don’t accurately reflect the proper care business critical servers require today.
Server Maintenance is part of managing a modern network infrastructure. We certainly haven’t seen the last of the large cloud provider reboots. Urgent security patches will cause service disruptions. That’s a given. As we move more of our data to the cloud, downtime is inevitable. But the accompanying anxiety can be minimized with proper planning and communication.
When in doubt, communicate. I know it sounds simple, but communication among IT, management, and staff really does seem to be the key to managing a successful, less stressful server reboot.
Photo credit: Tailgate365