Re: [ADSM-L] Data Deduplication

Kelly,

I have more than 1 customer considering a de-dup VTL product.

It's true that for regular file systems, TSM doesn't redump unchanged
files, so people aren't getting AS LARGE a reduction in data stored (of
that type) as would a user of an old style full dump- incremental -
incremental - full dump product.

OTOH, even with TSM, your DB dumps (Exchange, SQL, most Oracle
implementations) are still for the most part full dumps, followed by
icrementals, then full dumps.  The larger the data base, in most cases,
the less the contents change.  And you can't use subfile-backup on
anything larger than 2 GB.

I have several customers that have a relatively small number of clients
(say 50 or less), but the bulk of their daily backup data is 1 or 2 very
large data bases.  And the bulk of the CONTENTS of those data bases
doesn't change all that much.  Send that DB full dump to a de-dup VTL that
can identify duplicate "blobs" (I'm using that as a generic term because I
don't mean "block" in the sense of a disk block or sector and different
vendors can identify larger or smaller duplicate blobs), and you get a
very large impact that TSM can't provide.  The only thing that gets stored
each day is the delta bits.  Even if it's an Exchange/SQL/Oracle full-dump
day, the amount of new data to be stored may be 10% or less of what it
used to be.

And I have more than 1 customer looking at a de-dup VTL as a way to make
managing their own DR sites practical, because those VTL's can replicate
to EACH OTHER across the WAN.  The huge cost in transmitting your data to
a DR site is the cost of the pipe.  If, however, you can get the amount of
data per day down to 10% of what it used to be by having the VTL compress
and dedup, and you have another corporate location where you can put the
other VTL, it starts looking close to cost-effective in $$ terms.  In
fact, IBM recovery services is offering Data Domain equipment on the floor
in at least 1 of their recovery sites for that purpose.  (The customer
installs a a DD box on their site, leases the DD box in the IBM DR site,
replicates between.)

(Insert disclaimer here:  I'm not necessarily a fan of replicating backup
data, because the problem my customers always have is doing the DB
recovery. I think the first choice should be replicating the real DB using
something like MIMIX, so that it's always ready to go on the recovery end.
 I merely report the bit about replicating backup data because I have
customers considering it.)

Regarding the lost sales opportunities, I think you gotta go back and
consider the features that TSM has that other people don't, dedup or not -
there was a discussion on the list last month about comparing TSM to
Legato & others, and there was remarkably little emphasis on management
classes and the ability of TSM to treat different data differently
according to business needs- I still haven't seen any other product that
has what TSM provides.  (Here not afraid to expose MY ignorance - would
like to know if there is anything else out there -)

Wanda

> I'd like to steer this around a bit.  Our sales folks are saying they
> are losing TSM opportunities to de-dup vendors.  What specific business
> problem are customers trying to solve with de-dup?
>
> I'm thinking the following:
>
> 1. Reduce the amount of disk/tape required to storage backups.
> Especially important for all an all disk backup solution.
> 2. Reduce backup times (for source de-dup I would think.  No benefit in
> target de-dup for this).
> 3. Replication of backup data across a wide area network.  Obviously if
> you have less stored you have less to replicate.
>
> Others?  Relative importance of these?
>
> Does TSM in and of itself provide similar benefits in its natural state?
> From this discussion adding de-dup at the backend does not necessarily
> provide much though it does for the other traditional backup products.
> Since we don't dup, we don't need to de-dup.
>
> Help me get it because aside from the typical "I gotta have it because
> the trade rags tell me I gotta have it", I don't get it!
>
> Thanks, (Once again not afraid to expose my vast pool of ignorance...)
>
>
> Kelly J. Lipp
> VP Manufacturing & CTO
> STORServer, Inc.
> 485-B Elkton Drive
> Colorado Springs, CO 80907
> 719-266-8777
> lipp AT storserver DOT com
>
> -----Original Message-----
> From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf 
> Of
> Curtis Preston
> Sent: Wednesday, August 29, 2007 1:08 PM
> To: ADSM-L AT VM.MARIST DOT EDU
> Subject: Re: [ADSM-L] Data Deduplication
>
>>As de-dup, from what I have read, compares across all files on a
>>"system" (server, disk storage or whatever), it seems to me that this
>>will be an enormous resource hog
>
> Exactly.  To make sure everyone understands, the "system," is the
> intelligent disk target, not a host you're backing up.  A de-dupe
> IDT/VTL is able to de-dupe anything against anything else that's been
> sent to it.  This can include, for example, a file in a filesystem and
> the same file inside an Exchange Sent Items folder.
>
>>The de-dup technology only compares / looks at the files with in its
>>specific repository.  Example: We have 8 Protectier node in one data
>>center which equtes to 8 Virtual Tape Libraries and 8   reposoitires.
> The
>
> There are VTL/IDT vendors that offer a multi-head approach to
> de-duplication.  As you need more throughput, you buy more heads, and
> all heads are part of one large appliance that uses a single global
> de-dupe database.  That way you don't have to point worry about which
> backups go to which heads.  Diligent's VTL Open is a multi-headed VTL,
> but ProtecTier is not -- yet.  I would ask them their plans for that.
>
> While this feature is not required for many shops, I think it's a very
> important feature for large shops.
>