Re: [ADSM-L] Data Deduplication

At 03:40 PM 8/29/2007, Kelly Lipp wrote:

Help me get it because aside from the typical "I gotta have it
because the trade rags tell me I gotta have it", I don't get it!


Kelly,

I think you are correct in that TSM already gives you some of the
benefits that a more traditional backup product would get by using a
dedup VTL.  But TSM only does it at the file level.  I.e., if a file
doesn't change, TSM won't back it up again, whereas other backup
products might.  But a dedup VTL will go further, in that it will
dedup more information.  For example, common files that exist across
a bunch of clients (think about emails, attachments, Windows System
Objects), or also things like Oracle database backups.

There is still a benefit to using a dedup VTL in a TSM environment,
but not nearly as great as in a traditional backup environment
(father/son/grandson).  Since you will likely pay some sort of
premium for a dedup VTL, the question is: is the premium worth
it?  Or would you be better off buying a bunch of cheaper storage
(tape or even SATA disk) and storing those extra copies?  The answer,
of course, is "it depends".  But I think dedup VTLs will be a harder
sell in a TSM environment than in other environments.

As I've researched this, I'm thinking more about buying a smaller
dedup VTL as an adjunct to our other back-end storage, which would
allow us to target certain types of data that we know will dedup
well, such as Windows System Objects, Exchange server backups, Oracle
backups, etc.  One problem with this, is that the best way to do this
is via TSM management classes, but they are overloaded with other
things like retention, versions, etc.

It might be nice to see TSM introduce some new capabilities to help
support a dedup VTL, or perhaps do some of what Curtis calls
source-deduping.  I know they've been thinking about something along
these lines for awhile.

One other point about dedup VTLs:  some do their deduping in-band
whereas others do them out-of-band.  The in-band ones will avoid
storing duplicate data, but can be more performance limited.  This is
only an issue if you need to move more data than they can
handle.  The out-of-band ones will store the data, then dedup it
afterwards.  At least one of these that I know of (Sepaton) can scale
their performance by adding engines.  I believe that one vendor
recently now supports either way of doing this.

..Paul



--
Paul Zarnowski                            Ph: 607-255-4757
Manager, Storage Services                 Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801    Em: psz1 AT cornell DOT edu