Veritas-bu

Re: [Veritas-bu] Tapeless backup environments?

2007-10-16 19:01:13
Subject: Re: [Veritas-bu] Tapeless backup environments?
From: A Darren Dunham <ddunham AT taos DOT com>
To: veritas-bu AT mailman.eng.auburn DOT edu
Date: Tue, 16 Oct 2007 22:39:23 +0000
On Tue, Oct 16, 2007 at 12:09:30AM -0400, bob944 wrote:
> One of us still doesn't understand this. :-)
> 
> Your blog raises a red herring in misunderstanding or misrepresenting
> the applicability of Birthday Paradox.  The number of possible values in
> BP is 366; there is no data reduction in it, no key values.

The 366 isn't the data space, it's the keyspace.  When we look at a
person's birthday, we're hashing them into that space.  The "paradox"
then is how many people can we hash before the chance of a "collision"
is significant.  

Obviously if 400 people are in a room, the number of values exceeds the
keyspace and the probability of a collision is greater than 1.

> An
> algorithm which reduced the 366 possibilities the same way that hashing
> 8KB down to 160 bits would yield infinitesimal keys smaller than one
> bit, an absurdity.

I'm afraid I don't understand what you mean with that sentence.

> An absurdity which should show that even if it
> stopped at eight bits, one short of the bits required to hold 1-366,
> there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
> 30 all represented by the same code, in which case you can't figure out
> if people in the room have the same birthday.

What is stopping at 8 bits?  Hash collisions can always occur.  The
question is what is the probability.
 
> What you must grasp is that it is *impossible* to
> represent/re-create/look up the values of 2^65536 bits in fewer than
> 2^65536 bits--unless you concede that each checksum/hash/fingerprint
> will represent many different values of the original data--any more than
> you can represent three bits of data with two.

I think everyone aknowledges that as a fact.

> Hashing is a technique for saving time in certain circumstances.  It is
> valueless in re-creating (and a lookup is a re-creation) original data
> when those data can have unlimited arbitrary values.

The argument is that a process does not have to be infallible to be
valuable, much like the electrical and mechanical processes we currently
use.  That if the chance of failure in the algorithm is much less then
the chance of other parts of the system introducing silent data
corruption, then the overall amount of data loss is not significantly
changed. 

-- 
Darren Dunham                                           ddunham AT taos DOT com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu