We know that the amount of data we store is exploding. Not only are we collecting more detailed information about specific things (e.g. servers, digital images, etc.), but we are also collecting information about more things (e.g. refrigerators, automobiles, etc.). We are quite literally experiencing the “Internet of Things” where uniquely identifiable objects are connected through modern networking technology to provide us with an abundance of information. Wouldn’t it be nice to have an easy way to condense and to manage this wealth of information? Enter stage right the technology known as data deduplication. Quite simply, this is a technology which reduces duplicate information into a set of unique data patterns. Please see my other post on deduplication for more details.
In an oversimplified view of the deduplication process, every new data pattern read by a file system can be fingerprinted with a unique hash and that fingerprint can be compared with an index of previously recorded data patterns and their associated fingerprints. This process of reading data patterns, fingerprinting them, comparing them with existing patterns, and then storing unique patterns or creating references for non-unique data patterns requires computational resources. This is also true of the reverse process when data is reassembled for use. These computational resources may not be trivial.
In the case of source deduplication the client system can experience increases in processor and/or memory load up to 20%. This can be significant in a virtual environment where several clients share host resources—especially if each client sees performance degradation at the same time. Additionally there may be a slight delay in data read/write times due to this added processing. This implies that deduplication may be better suited for large collections of data that does not change often and does not require rapid access. It also implies that the deduplication process may be better implemented as a process at the destination rather than at the source of read/write processes.
Another issue to consider is that deduplication relies on duplicate data patterns. Technologies like encryption—which works to remove recognizable patterns within a dataset—may affect and may even be incompatible with deduplication processes. Understanding how data deduplication interacts with data security is paramount to effectively storing computer data.
With the rapid growth of business data we need to find a reliable way to quickly store and retrieve information. Data deduplication can provide dramatic benefits by reducing data storage requirements. Data deduplication can also impact IT resources both by the compute resources it consumes and by how it changes data. It is an amazing technology, but it is not the solution for every data storage need. It is critical to first properly understand how data deduplication works and to weigh the benefits against the costs before we can effectively implement data deduplication in our business environments.