True stories from the front lines of network administrators can be hair-raising. Here’s one that we can live through with the real-time tweets of the harried system admin:
7:30pm: One of the servers is down. Heading to office to investigate. Taking bus since scooter has dead battery.
7:41pm: Coming back from week-long tech conference and then find out 1 of the servers went out this afternoon.
8:15pm: Crud. The Friday backup tapes weren’t put in correctly as shown on the post-it notes next to tape bays.
8:17pm: Crud. Big problem for me.
8:44pm: Crud. Getting big headache. Haven’t had anything to eat. And 1 of our servers is down.
8:53pm: Crud. It’s gonna might be long night if I don’t figure what’s w/the server. Time to figure out a resume, too.
8:56pm: Yeah. Might as well get something to eat. Not thinking well on empty stomach.
9:32pm: Now that I’m able to think a bit, I’ll try system recovery cd on the node.
10:24pm: Dang. I failed setting up redundancy. Doing a disaster recovery. I should start working on a resume.
10:48pm: Data on server not fine. Have to restore from backup. Actually testing out disaster recovery. Sorta.
2:17am: Spent last few hours w/vendor support troubleshooting server hardware. One of memory was bad. Getting new memory in 4 hours.
4:30am: Still waiting for memory to be delivered. In 3 hours?
5:41am: Oh. It’s Monday morning.
6:57am :*Cringe* Still no part yet. Users coming in to office soon.
7:29am: So it’s DHL that’s delayed the supposed 4 hour delivery time of the part we’re waiting for since 2AM.
8:17am: Ugh. Did the memory replacement fry the motherboard? The replacement memory looks refurbished. Yellow sticker on bag dated 6/29/11.
8:30am: Oh great. CEO and CFO asking for status.
8:37am: Now a tech 2 come over w/replacement system board & memory w/in hour.
9:30am: Ah yes. Awake for over 24 hours now.
2:35pm: A full replacement of the motherboard on the node. Now waiting to receive new licensing for that node.
4:45pm: Crashed server up & running & licensed.
5:40pm: Left work tired and sleepy.
Ouch! In this case, this admin survived relatively unscathed from a major disaster, at the price of a marathon overnight struggle. Now, imagine instead if, using a ShadowProtect backup image of the server, the admin started a StorageCraft HeadStart Restore job using ImageManager.
7:30pm: One of the servers is down. Heading to office to investigate.
8:00pm: Server is definitely down w/HW issue. Happy I’d ran a HeadStart Restore VM for this Server. Launched the HSR failover server.
8:37am: At the office refreshed. Tech coming over with new memory & motherboard for dead server.
4:45pm: Crashed server repaired and synced up with my HeadStart Restore VM. Server up & running. Reported to CEO successful failover and restoration with no downtime. Promotion likely.
Going through an all-nighter like this admin, fighting tech, trying to find the problem and a solution is a painful reality. Knowing that the axe may fall when the boss comes in and finds the system down, and you don’t have a clear idea of when or if the system will be back up is an experience that should be “so last year.” A better contemporary strategy is to:
- Leverage backup image to pre-stage a VM using a StorageCraft HEadSTart Restore job in case the server goes down.
- Use an automated process to keep the backup image current with the VM
- Restore to repaired hardware, new hardware (similar or dissimilar), or virtual environments using the current backup image and StorageCraft Hardware Independent Restore technology.