Re: [ADSM-L] Data Deduplication

Since this message is pretty pro-de-dupe, I want to mention that I don't
sell any of this stuff.  I'm just excited about the technology, have
many customers large and small using it, and want to make sure it's
accurately represented.

>"We don't need tape, because disk is cheap!"
>[...hiatus...]
>"We have to save disk!  Buy (and integrate, and manage) a new product!"

I would put that history slightly differently.  I don't know anyone who
knew what they were doing that was saying "we don't need tape!"  What
they were saying is:

"Tape drives are now way too fast!  We have to stage to disk to backup
to stream the drives.  Wouldn't it be cool if we could also do away with
tape onsite, but we still need it for offsite."

[...hiatus...]

"Holy crap!  VTLs are expensive! Forget the store all onsite backups on
disk part.  Let's just do staging.  That requires a much smaller amount
of disk."

[...hiatus...]

"De-dupe is here.  Using that, we can take the amount of disk that we
would have bought just for staging and store all our onsite backups on
it.  Wow."

>I think a back-end de-dup (de do da da) would still offer advantages
>to TSM: if you've got mumblety-hundred (e.g.) Win2K boxen, then most
>of their system and app space would be identical. This could,
>concievably, end up as close to one system-images' worth of space on
>the back end.  In a fantasy. :)

This is not a fantasy.  There are products that have been GA for 3+
years that are doing just this.  These products also notice when a file
has been modified multiple times and just backs up the new blocks that
were changed each time.  In addition, these products also notice users'
files that are common between the filesystem and sitting inside Exchange
inboxes and Sent Items folders, for example.  They notice attachments
that were sent to multiple remote offices that have already been backed
up.  All of tis is reality, is GA, and is being used by many companies,
many of them very, very large.

>However, the server would need to do an awful lot of work to correlate
>all these data.

It's not easy, but it's not as hard as you may think.  The main work
comes from two things: computing a SHA-1 hash on each block of data and
looking up that hash in a big hash table.  The first is only performed
by each client (speaking of source de-dupe) on new or changed files, so
it's not as bad as you might think.  The second can handle quite a few
clients simultaneously without being a bottleneck.  At some point, you
may need multiple hash tables and servers to handle the lookup, but the
workload can be distributed.  For example, install a second lookup
server and each server handles lookups for half of the total list of
hashes.

As to how fast de-dupe backup software is, it's definitely fast enough
to keep up with remote offices and medium-sized datacenters.  Once we
start getting into many TBs of LOCAL data (i.e. a large datacenter),
there are much more efficient ways to back it up.  But if the data is
remote, de-dupe backup software is hard to beat.

(These last few comments were about de-dupe backup software -- not to be
confused with de-dupe VTLs.  Those actually go VERY fast and can handle
the largest of environments.)