Veritas-bu

Re: [Veritas-bu] Tapeless backup environments?

2007-10-16 00:32:54
Subject: Re: [Veritas-bu] Tapeless backup environments?
From: "bob944" <bob944 AT attglobal DOT net>
To: <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Tue, 16 Oct 2007 00:09:30 -0400
cpreston <netbackup-forum AT backupcentral DOT com>:
> As promised, I looked into applying the Birthday Paradox 
> logic to de-duplication.  I blogged about my results here:
> 
> http://www.backupcentral.com/content/view/145/47/
> 
> Long and short of it: If you've got less than 95 Exabytes of 
> data, I think you'll be OK.

One of us still doesn't understand this. :-)

Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox.  The number of possible values in
BP is 366; there is no data reduction in it, no key values.  An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.  An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure out
if people in the room have the same birthday.

What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more than
you can represent three bits of data with two.

Hashing is a technique for saving time in certain circumstances.  It is
valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values.  All the blog
hand-waving about decimal places, Zetabytes and the specious comparison
to undetected write errors will not change that.  What _would_ be a
useful exercise for the reader is to discover how many unique values of
8KB are, on average, represented by a given 160-bit
checksum/hash/fingerprint.


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu