Re: [Veritas-bu] Tapeless backup environments?

At the risk of chasing windmills, I will continue to try to have this
discussion, although it appears to me that you're already made up your
mind.  I again say that no one is saying that hash collisions can't
happen.  We are simply saying that the odds of them happening are
astromically less than having an undetected/uncorrected bit error on
tape.  And I believe that the math that I use in my blog post
illustrates this.

I said:
> As promised, I looked into applying the Birthday Paradox 
> logic to de-duplication.  I blogged about my results here:
> 
> http://www.backupcentral.com/content/view/145/47/
> 
> Long and short of it: If you've got less than 95 Exabytes of 
> data, I think you'll be OK.

Bob944 said:
>>One of us still doesn't understand this. :-)

Got that right. :-)

>>Your blog raises a red herring in misunderstanding or misrepresenting
>>the applicability of Birthday Paradox.  

I completely disagree.  If you read the Birthday Paradox entry on
Wikipedia, it specifically explains how the Birthday Paradox applies in
this case.  All the BP says is that the odds of a "clash" (i.e. a
birthday match or a hash collision) in an environment increase with the
number of elements in the set, and that the odds increase faster than
you think:

* The odds of two people in the same room having the same birthday 
  increase with the number of people in the room.  If there are only
  two people in the room, those odds will be roughly 1 in 365, or .27% 
  (leap year aside).  If there are 23 people in the room, 
  the odds are 50%.
  
* The odds of two DIFFERENT blocks having the same hash (i.e. a
  hash collision) increase with the number of blocks in the data set
  If there are two blocks in the set, the odds are 1 in 2^160.
  If there are less than 12.7 quintillion blocks in the data set,
  the odds don't show up in a percentage calculated out to 50 decimal
  places.  As soon as you have more than 12.7 quintillion blocks, the
  odds at least register in 50 decimal places, but are still really 
  small.  And to get 12.7 quintillion blocks, you need to store at
  least 95 Exabytes of data.

>The number of possible values in
>BP is 366; there is no data reduction in it, no key values.  An
>algorithm which reduced the 366 possibilities the same way that hashing
>8KB down to 160 bits would yield infinitesimal keys smaller than one
>bit, an absurdity.  

Yeah, IMHO, we are talking apples and oranges.  Let me try to put the
hash collision into the birthday world.  Let's say that we want a wall
of photos of everyone who came to our party.  When you show up, we check
your birthday,  and we check it off on a list.  (We'll call your BD the
"hash.")  If we've never seen your birthday before, we take your photo
and put it on the wall.  If your birthday has already been checked off
on the list, though, we don't take your photo.  We assume that since you
have the same birthday, you must be the same person.  So you don't get
your photo taken.  We just write on the photo of the first guy whose
picture we took that he came to the party twice (he must have left and
come back).  Now, if he is indeed the same guy, that's not a hash/BD
collision.  If he is indeed a different person, and we said he was the
same person simply because he had the same birthday, then that would be
a hash/BD collision.

And THIS would be an absurdity to think you can represent n number of
people in a party with an array of photos selected solely on their
birthday (a key space of only 366).  But it's not out of the realm of
possibility to say that we could represent n number of bits in our data
center with an array of bits selected solely on a 160-bit hash (a
keyspace of 2^160).  Crytographers have been doing it for years.  We're
just adding another application on it.

>>An absurdity which should show that even if it
>>stopped at eight bits, one short of the bits required to hold 1-366,
>>there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
>>30 all represented by the same code, in which case you can't figure
out
>>if people in the room have the same birthday.

Again, I hope if you read what I read above.  In the analogy, we're not
de-duping birthdays; we're de-duping people BASED on their birthdays.
(Which would be a dumb idea because the key space is too small: 366)

>>What you must grasp is that it is *impossible* to
>>represent/re-create/look up the values of 2^65536 bits in fewer than
>>2^65536 bits--unless you concede that each checksum/hash/fingerprint
>>will represent many different values of the original data--any more
than
>>you can represent three bits of data with two.

I concede, I concede!  The only point I'm trying to make is what are the
odds that two different blocks of data will have the same hash (i.e. a
hash collision) bin a given data center.

>>Hashing is a technique for saving time in certain circumstances.  It
is
>>valueless in re-creating (and a lookup is a re-creation) original data
>>when those data can have unlimited arbitrary values.  All the blog
>>hand-waving about decimal places, Zetabytes and the specious
comparison
>>to undetected write errors will not change that.  

This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe.  Fine!  That's why there are vendors that don't use hashes to
de-dupe data.  Buy one of those instead.  Some use a weak hint +
bit-level verify, some use delta-differencing technologies, which are
bit-to-bit comparisons as well.

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu