Veritas-bu

Re: [Veritas-bu] Tapeless backup environments?

2007-10-01 05:53:11
Subject: Re: [Veritas-bu] Tapeless backup environments?
From: "McCammont, Anderson \(IT\)" <Anderson.Mccammont AT morganstanley DOT com>
To: "Curtis Preston" <cpreston AT glasshouse DOT com>, <bob944 AT attglobal DOT net>, <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Mon, 1 Oct 2007 10:27:06 +0100
> -----Original Message-----
> From: veritas-bu-bounces AT mailman.eng.auburn DOT edu 
> [mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf 
> Of Curtis Preston
> Sent: 01 October 2007 06:35
> To: bob944 AT attglobal DOT net; veritas-bu AT mailman.eng.auburn DOT edu
> Subject: Re: [Veritas-bu] Tapeless backup environments?
...
> 
> These are odds based on the size of the key space.  If you have 2^160
> odds, you have a 1:2^160 chance of a collision.

by saying that, the implication is that the keyspace is uniform.  It's
not.  The probablity of a hash collision is a function of the uniformity
of the keyspace as well as the number of items you've hashed and the
size of the key.  There's lots of research in the crypto field that's
relevant to de-dupe.

You also should consider the characteristics of the de-dupe software
when it encounters a hash collision.  Backups are the last line of
defence for many, when all else (personal copies, replication, snapshots
etc.) has failed.  The 'acceptable risk' of a hash collision is of
little comfort when you've got one.  Does it fail silently, throw it's
hands in the air and core dump, or handle the situation gracefully and
carry on without missing a beat.  Ask them what they do.  As Curtis
mentioned, not all de-dupe s/ware relies purely on hashes.  

Balance this with the /fact/ that there's already a chance of undetected
corruption in the components you buy today, which is why most
technologies that survive impose their own data validation checks
instead of relying purely on the underlying technology in the stack to
have checked it for them.  The multi-layered checks that go on improve
your overall confidence. 

At least one design in the SiS field also accepts that hashing
algorithms will improve over time and they've had the foresight to be
able to drop in new hashing schemes in future.

When picking de-dupe software you should also care about Intellectual
Property.  Who's got what isn't necessarily clear in this space, and the
patent lawyers won't be far away.  Picking the big boys help here, but
also look at people with a mature view to the marketplace (eg. some
companies are prepared to talk about licensing deals rather than court
cases when they encounter infringement)

There's lots of other things to consider in picking an algorithm,
including how well it handles patterns that don't fall naturaly on block
boundaries (think of the challenges involved in de-duping 'the quick
brown fox' and 'the quicker brown fox') that will affect de-dupe ratios,
and how that affects performance.  And the solution's not just about the
algorithm.

De-dupe is a great advance, and a disruptive technology not just for
backup but also for primary storage.  Look forward to it, but go in with
your eyes open.
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not 
intend to waive confidentiality or privilege. Use of this email is prohibited 
when received in error.

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu