Dupe, Dedupe

Dupe, Dedupe

November 6

No, this is not an article about Miley Cyrus’ latest song.  This is an article about data deduplication, often referred to as “dedupe”.  The intent of this article is to briefly discuss what data duplication is and how it might be employed in your current BDR plan.

Data deduplication is a specialized data compression technique.  In its simplest form, the deduplication process compares unique byte patterns in chunks of data intended for storage with an internal index of data already stored.  Whenever a match occurs the redundant chunk of data is replaced with a small reference that points to the previously stored data.

Another way to think about data deduplication is where it occurs.  A deduplication process which occurs close to where the data is created is referred to as “source deduplication” whereas a similar deduplication process occurring close to where the data is stored is a “target deduplication”.

Data deduplication carries with it many of the same drawbacks and benefits of other compression processes.  For example, whenever data is transformed there is a potential risk of lost or corrupted data.  In addition, there may be the added overhead of computational resources required for the compression process.  Hopefully the benefit of an optimized storage footprint outweighs the risk and where large amounts of data is concerned, this is very possible.

However if we consider the low cost of drive space today a small business might do well to consider buying additional storage capacity rather than purchase and implement a deduplication process.  One study using IBM disk manufacturing data implies that the cost per GigaByte is dropping by roughly 37.5 percent each year.

So, before you “throwdown” that pile of cash you might consider integrating low cost data storage as a safer and easier solution to implementing data deduplication processes.  Meanwhile, check back here at StorageCraft often for more backup and data recovery solutions.

  1. Octavian Grecu on


    I’m just wondering if any of you have actually tested this scenario in the end and come to any conclusion since this article was published.

    Thank you!

  2. tommcg on

    I think you are missing the point entirely here. I have a home with 5 PCs all running same Windows OS version and same versions of Office. MOST of the file data on the machines are copies of same files on other machines: the Windows OS files and Office binaries. I want to backup full system snapshot images (not just photos and music) daily to a NAS on my LAN, or even a headless Windows machine acting as a NAS (like the old Windows Home Server product). I want the bandwidth savings of laptops backing up over wifi to notice that those windows files are already stored and not transmit them over wifi. I also want the total NAS storage of all combined backups reduced so that I can copy the NAS storage to either external drive for offsite storage, or more interesting up to the cloud for redundancy. ISP bandwidth caps, limited upstream bandwidth, and cloud storage annual cost per GB mean that deduplicated backup storage is essential. The cost of additional local storage is NOT the only consideration.

    I don’t care about Windows Server’s integrated deduplication. The deduplication has to be part of the backup system itself, especially if you are doing cluster or sector level deduplication, to avoid sending the duplicate data over the wire to the data storage in the first place.

    I’ve been looking at different backup solutions to replace Windows Home Server (a decade-old product that offered deduplication), and your product looked very interesting, but unfortunately the lack of built-in deduplication rules it out for me. I can only imagine how this affects 100-desktop customers when I wont’t even consider it for 5-desktop home use.

  3. Steven Snyder on

    Thank you for your comments. We appreciate all points of view on this topic.

    I agree that ISP bandwidth caps, limited upstream bandwidth, and cloud storage cost per GB show how critical it is to minimize data transmissions offsite. I also believe that much like modems and BETA video tapes, the bandwidth of today is giving way to higher access everywhere. For example, Google Fiber is now available to some of my peers at the office. Cellular LTE and satellite technologies are also increasing bandwidth for small business and home offices. At the same time, our data consumption and data creation is increasing at a rate that may outpace this increased supply of bandwidth. Either way, there are ways to work around data transmission limits.

    One way we help with data transmission over slower networks is we incorporate WAN acceleration and bandwidth scheduling technologies into our offsite replication tools. These allow you to not only get the most efficient use of available bandwidth but to also schedule your data replication during off-peak hours. Another way we help with data transmission is through compression. Deduplication is after all simply another form of data compression which reduces the near side (source) data before it is transmitted over the wire (target).

    In your case, you could use our product to store images on a local volume which has deduplication. You could then replicate data over the wire to offsite storage using ImageManager or some other tool. Many of our customers do this very thing.

    Keep in mind that the deduplication process has to occur at some point: either at the source or at the target. If you wanted to deduplicate your 5 PCs you would be best served with a BDR solution that can read each of those PCs, see the duplicate files on each, and avoid copying those files to storage. In this example, deduplication would occur on your BDR but you’re still reading data from each PC over the wire to your BDR. In addition, your BDR would control the index for data stored on a separate volume or perhaps has the storage volume incorporated in the BDR. This creates a single point of failure because if your BDR crashes then the backup images for your 5 PCs wouldn’t be recoverable and current backup processes cease.

    At StorageCraft we focus on the recovery. Our philosophy means that we take the smallest fastest backup images we can and then we give you ways to automatically test those images for reliability, compress them into daily/weekly/monthly files according to your retention policy, and replicate those images locally and offsite. This gives you a solid foundation from which to recover those images quickly to almost any new environment. I have yet to see a faster more reliable solution among our competitors.