Re: [ADSM-L] Data Deduplication

2007-08-27 16:17:02
Subject: Re: [ADSM-L] Data Deduplication
From: Curtis Preston <cpreston AT GLASSHOUSE DOT COM>
Date: Mon, 27 Aug 2007 16:14:42 -0400
>As others have noted, different vendors dedup at different levels of

I think I'd put it slightly differently.  I'd say that they each
approach it differently.  Those different approaches may have advantages
and disadvantages with different data types.

>When I spoke to Diligent at the Gartner conference over
>a year ago, they were very tight-lipped about their actual

The patent was filed.  It's not that secret. ;)  They are quite
different in their approach, and it's a little different to grock.  But
based on what I know about their approach, the scenario that started the
discussion may indeed be a limitation.  (Or all the vendors may have
this limitation; I have some questions out to them.)

>The[y] would, however, state that they were able to dedup
>parts of two files that had similar data, but were not
>identical.  I.e., if data was inserted at the beginning of the file,
>some parts of the end of the file could still be deduped.  Neat trick
>if it's true.  

Any de-dupe vendor is able to claim that.  If it wasn't true, they
wouldn't see the de-dupe rates they're seeing.  They can also identify
blocks that are common between a file in the file system and the same
file emailed via Exchange.

>Other vendors dedup at the file or block (or chunk) level.

If a vendor doesn't do subfile de-dupe, then they're not a de-dupe
vendor; they're a CAS vendor.  File-level de-dupe is CAS (i.e. Centerra,
Archivas), and the de-dupe is not really pitched as the main feature.
It's about using the signature as a way to provide immutability of data
stored in the CAS array.

>I've not been able to gather much more detail about the specific
>dedup algorithms, but hope to get some more info this fall, as take a
>closer look at these products.  If anyone has more details, I'd love
>to hear them.

I wrote this article that may help: . I also
blog about de-dupe quite a bit at

