Blood, Sweat, and Tears: The Tale of One Server Disaster

Blood, Sweat, and Tears: The Tale of One Server Disaster

January 31

True stories from the front lines of network administrators can be hair-raising. Here’s one that we can live through with the real-time tweets of the harried system admin:


7:30pm: One of the servers is down. Heading to office to investigate. Taking bus since scooter has dead battery.

7:41pm: Coming back from week-long tech conference and then find out 1 of the servers went out this afternoon.

8:15pm: Crud. The Friday backup tapes weren’t put in correctly as shown on the post-it notes next to tape bays.

8:17pm: Crud. Big problem for me.

8:44pm: Crud.  Getting big headache.  Haven’t had anything to eat. And 1 of our servers is down.

8:53pm: Crud. It’s gonna might be long night if I don’t figure what’s w/the server. Time to figure out a resume, too.

8:56pm: Yeah.  Might as well get something to eat. Not thinking well on empty stomach.

9:32pm: Now that I’m able to think a bit, I’ll try system recovery cd on the node.

10:24pm: Dang. I failed setting up redundancy. Doing a disaster recovery. I should start working on a resume.

10:48pm: Data on server not fine. Have to restore from backup. Actually testing out disaster recovery. Sorta.


2:17am: Spent last few hours w/vendor support troubleshooting server hardware. One of memory was bad. Getting new memory in 4 hours.

4:30am: Still waiting for memory to be delivered. In 3 hours?

5:41am: Oh. It’s Monday morning.

6:57am :*Cringe* Still no part yet. Users coming in to office soon.

7:29am: So it’s DHL that’s delayed the supposed 4 hour delivery time of the part we’re waiting for since 2AM.

8:17am: Ugh. Did the memory replacement fry the motherboard? The replacement memory looks refurbished. Yellow sticker on bag dated 6/29/11.

8:30am: Oh great. CEO and CFO asking for status.

8:37am: Now a tech 2 come over w/replacement system board & memory w/in hour.

9:30am: Ah yes. Awake for over 24 hours now.

2:35pm: A full replacement of the motherboard on the node.  Now waiting to receive new licensing for that node.

4:45pm: Crashed server up & running & licensed.

5:40pm: Left work tired and sleepy.

Ouch! In this case, this admin survived relatively unscathed from a major disaster, at the price of a marathon overnight struggle. Now, imagine instead if, using a ShadowProtect backup image of the server, the admin started a StorageCraft HeadStart Restore job using ImageManager.


7:30pm: One of the servers is down. Heading to office to investigate.

8:00pm: Server is definitely down w/HW issue. Happy I’d ran a HeadStart Restore VM for this Server. Launched the HSR failover server.


8:37am: At the office refreshed. Tech coming over with new memory & motherboard for dead server.

4:45pm: Crashed server repaired and synced up with my HeadStart Restore VM. Server up & running. Reported to CEO successful failover and restoration with no downtime. Promotion likely.

Going through an all-nighter like this admin, fighting tech, trying to find the problem and a solution is a painful reality. Knowing that the axe may fall when the boss comes in and finds the system down, and you don’t have a clear idea of when or if the system will be back up is an experience that should be “so last year.”  A better contemporary strategy is to:

  1. Leverage backup image to pre-stage a VM using a StorageCraft HEadSTart Restore job in case the server goes down.
  2. Use an automated process to keep the backup image current with the VM
  3. Restore to repaired hardware, new hardware (similar or dissimilar), or virtual environments using the current backup image and StorageCraft Hardware Independent Restore technology.