So many variables can lead to data corruption. The perennial hard drive failures. Earthquakes, hurricanes, and other “Acts of God.” Human error, which often manifests itself in the spread of viruses and other malware. As an IT professional, you’re familiar with all of these potential hazards and have set up firewalls, orchestrated various levels of redundancies, and deployed other mechanisms to curtail these problems as best you can.
Occasionally though, you find yourself faced with corrupted files with no obvious cause. Perhaps your CFO can’t open an Excel spreadsheet, and you can’t access most (if not all) of its backup copies. Or your DBA can’t start a SQL database.
Software-generated data corruption is especially frustrating because they happen under myriad and diffuse conditions–and worse, typically happen without any alerts of any type. Back in 2007, storage guru Robin Harris writes that “Silent data corruption is common,” and then lists several common software bugs that can lead to data corruption:
- New code that fixes a problem and accidently breaks old code
- Putting the right data in the wrong place.
- Phantom writes that are reported as written but, oops!, aren’t.
- Cache management bugs that munge data, or return correct data to the wrong place.
- … Less common, but sometimes the on-disk ECC miscorrects the data. ECC is software, right? How do you know it always works correctly? You don’t.
And 2007-era IT environments are child’s play compared to today’s hybrid IT infrastructures, which typically are distributed over multiple locations and are composed of:
- Legacy hardware
- Virtualized infrastructure
- Hosted applications and services
- Public and private clouds (also BYOC, short for “Bring Your Own Cloud,” if you consider services like Dropbox, Box, and SugarSync, among others)
- BYOD (Bring Your Own Device)
Minimizing the Problem
In a 2009 research article titled Toward Exascale Resilience, Franck Cappello, the Co-Director of the INRIA-Illinois Joint Laboratory on PetaScale Computing, writes about why software-caused data corruption is especially onerous:
We already mentioned the lack of coordination between software layers with regards to errors and fault management. Currently, when a software layer or component detects a fault it does not inform the other parts of the software running on the system in a consistent manner. As a consequence, fault-handling actions taken by this software component are hidden to the rest of the system [emphasis mine]. …In an ideal wor[l]d, if a software component detects a potential error, then the information should propagate to other components that may be affected by the error or that control resources that may be responsible for the error.
This degree of complexity makes avoiding data corruption almost impossible, even if you could stamp out every other hardware- and human-based cause. Given the enormity of this problem, you may be tempted to run out of your facility screaming at the top of your lungs. Assuming that isn’t an option, here are a couple of things you could do to minimize such corruption from hobbling your business:
1. Make incremental backups of your data, preferably by taking regular (say, every 15 minutes or so) “snapshots.” Not only can these images help you find the most recent uncorrupted copy of your CFO’s Excel file, they will help you pinpoint the time when the corruption took place. Moreover, these backups can help you figure out patterns or actions that may be leading to this type of corruption so that you can avoid it in the future.
2. Deploy a disaster recovery solution, such as StorageCraft’s ShadowProtect, which can get you up and running even if you’ve lost access to all of your mission-critical data. This case study explains how StorageCraft partner Numa Networks bailed out an important client and saved $800,000 worth of beer with ShadowProtect’s help.