The Basics of Deduplication: Data Type, Chunk Size, Source/Target, Re-hydration

The Basics of Deduplication: Data Type, Chunk Size, Source/Target, Re-hydration

February 26

Data deduplication is essentially a specialized data compression technology.  The name “deduplication” infers that this technology removes duplicate information from a given set of data.  The idea is that removing duplicate data patterns will reduce storage space requirements.  The technology replaces each additional duplicate chunk of data with a pointer to the “original” data pattern while at the same time maintaining an index of these pointers to allow the data to be rehydrated at a future time.

Deduplication and the benefits derived from this process will depend upon several factors including the original data type, the size of the data chunk, and the algorithm being used.  For example, a typical file system may contain multiple copies of a specific document.  Each additional copy beyond the first takes up space on the server.  A process that could replace these additional copies with a link to one copy would reduce the amount of space used to the size of the original document as well as the cumulative size of all the additional links.  In this example, the size of the original document is the size of the chunk of data being deduplicated.

A “chunk” refers to the size of data being processed.  Chunking techniques vary.  Some chunks are defined by physical layer constraints (e.g. a fixed block size of bits on a volume) while other chunks compare the sizes of entire files.  Still another chunking method determines chunks by sliding a reference “window” along the file stream to find patterns that occur within internal file boundaries.   It should be clear that both the type of data and the chunking method affect the number of duplicates that can be found.

Additionally, the deduplication algorithm and its implementation affect both how the duplicate chunks are stored and later how they are restored.  The storage process can occur either in-line and at the source where the data is read or it can occur at the target where the data is stored.  Source deduplication scans data on the source volume while target deduplication removes data duplicates on the secondary store.  Both the source or target implementation relies on computational resources at the point where the deduplication algorithm is applied and both have their own associated benefits and concerns.

Encryption should not be used with deduplication as the purpose of encryption is to eliminate any discernible patterns in the data.  Eliminating data patterns through encryption is counter to the deduplication process and can cause serious data integrity concerns.

We expect that at some future time the data which has been deduplicated and stored must be used.  In order to use the data, the deduplication process must be reversed and the stored chunks reassembled into the original data pattern.  This reassembly process is often referred to as data rehydration and the process requires additional computational resources beyond a simple data read process.  All of these elements—data type, chunk size, and the algorithm used to deduplicate and rehydrate data—will have an effect on data retrieval and storage processes as well as file storage requirements.  We highly recommend that any implementation of deduplication be thoroughly researched and tested before being rolled into your production environment.

Photo Credit: Daniel Kulinski via Compfight cc