Veritas-bu

Re: [Veritas-bu] Tapeless backup environments

2007-10-18 15:20:45
Subject: Re: [Veritas-bu] Tapeless backup environments
From: "Eagle, Kent" <KEagle AT wilmingtontrust DOT com>
To: <cpreston AT glasshouse DOT com>, <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Thu, 18 Oct 2007 14:58:31 -0400
Sorry, but I just can't keep from jumping in at this point.
Not taking either side, but...

Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research? I could place a posting on there that
either concurs with, or totally rejects the position of that posting;
and someone else would come along and claim it as gospel.

I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!

Saying
" This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe.  Fine!  That's why there are vendors that don't use hashes to
de-dupe data.  Buy one of those instead."
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that. Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no matter
what the numbers are, you're not going to accept..."

If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were. I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.

I would thing that almost everyone on this forum does some kind of pilot
before rolling something out into production.

I hope I'm wrong. I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...

BTW - You "Tilt at Windmills" (Don Quixote), you don't chase them.  ;-)

Take care,

Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS

------------------------------------------------------------------------
---
Message: 1
Date: Thu, 18 Oct 2007 04:06:52 -0400
From: "Curtis Preston" <cpreston AT glasshouse DOT com>
Subject: Re: [Veritas-bu] Tapeless backup environments?
To: <bob944 AT attglobal DOT net>, <veritas-bu AT mailman.eng.auburn DOT edu>
Message-ID:
        
<4FBA0941CF3D9347889AA5FF23A809BEF3C673 AT ghmail02.glasshousetech DOT com>
Content-Type: text/plain;       charset="US-ASCII"

At the risk of chasing windmills, I will continue to try to have this
discussion, although it appears to me that you're already made up your
mind.  I again say that no one is saying that hash collisions can't
happen.  We are simply saying that the odds of them happening are
astromically less than having an undetected/uncorrected bit error on
tape.  And I believe that the math that I use in my blog post
illustrates this.

I said:
> As promised, I looked into applying the Birthday Paradox
> logic to de-duplication.  I blogged about my results here:
> 
> http://www.backupcentral.com/content/view/145/47/
> 
> Long and short of it: If you've got less than 95 Exabytes of
> data, I think you'll be OK.

Bob944 said:
>>One of us still doesn't understand this. :-)

Got that right. :-)

>>Your blog raises a red herring in misunderstanding or misrepresenting 
>>the applicability of Birthday Paradox.

I completely disagree.  If you read the Birthday Paradox entry on
Wikipedia, it specifically explains how the Birthday Paradox applies in
this case.  All the BP says is that the odds of a "clash" (i.e. a
birthday match or a hash collision) in an environment increase with the
number of elements in the set, and that the odds increase faster than
you think:

* The odds of two people in the same room having the same birthday 
  increase with the number of people in the room.  If there are only
  two people in the room, those odds will be roughly 1 in 365, or .27% 
  (leap year aside).  If there are 23 people in the room, 
  the odds are 50%.
  
* The odds of two DIFFERENT blocks having the same hash (i.e. a
  hash collision) increase with the number of blocks in the data set
  If there are two blocks in the set, the odds are 1 in 2^160.
  If there are less than 12.7 quintillion blocks in the data set,
  the odds don't show up in a percentage calculated out to 50 decimal
  places.  As soon as you have more than 12.7 quintillion blocks, the
  odds at least register in 50 decimal places, but are still really 
  small.  And to get 12.7 quintillion blocks, you need to store at
  least 95 Exabytes of data.

>The number of possible values in
>BP is 366; there is no data reduction in it, no key values.  An 
>algorithm which reduced the 366 possibilities the same way that hashing

>8KB down to 160 bits would yield infinitesimal keys smaller than one 
>bit, an absurdity.

Yeah, IMHO, we are talking apples and oranges.  Let me try to put the
hash collision into the birthday world.  Let's say that we want a wall
of photos of everyone who came to our party.  When you show up, we check
your birthday,  and we check it off on a list.  (We'll call your BD the
"hash.")  If we've never seen your birthday before, we take your photo
and put it on the wall.  If your birthday has already been checked off
on the list, though, we don't take your photo.  We assume that since you
have the same birthday, you must be the same person.  So you don't get
your photo taken.  We just write on the photo of the first guy whose
picture we took that he came to the party twice (he must have left and
come back).  Now, if he is indeed the same guy, that's not a hash/BD
collision.  If he is indeed a different person, and we said he was the
same person simply because he had the same birthday, then that would be
a hash/BD collision.

And THIS would be an absurdity to think you can represent n number of
people in a party with an array of photos selected solely on their
birthday (a key space of only 366).  But it's not out of the realm of
possibility to say that we could represent n number of bits in our data
center with an array of bits selected solely on a 160-bit hash (a
keyspace of 2^160).  Crytographers have been doing it for years.  We're
just adding another application on it.

>>An absurdity which should show that even if it
>>stopped at eight bits, one short of the bits required to hold 1-366, 
>>there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun

>>30 all represented by the same code, in which case you can't figure
out
>>if people in the room have the same birthday.

Again, I hope if you read what I read above.  In the analogy, we're not
de-duping birthdays; we're de-duping people BASED on their birthdays.
(Which would be a dumb idea because the key space is too small: 366)

>>What you must grasp is that it is *impossible* to 
>>represent/re-create/look up the values of 2^65536 bits in fewer than 
>>2^65536 bits--unless you concede that each checksum/hash/fingerprint 
>>will represent many different values of the original data--any more
than
>>you can represent three bits of data with two.

I concede, I concede!  The only point I'm trying to make is what are the
odds that two different blocks of data will have the same hash (i.e. a
hash collision) bin a given data center.

>>Hashing is a technique for saving time in certain circumstances.  It
is
>>valueless in re-creating (and a lookup is a re-creation) original data

>>when those data can have unlimited arbitrary values.  All the blog 
>>hand-waving about decimal places, Zetabytes and the specious
comparison
>>to undetected write errors will not change that.

This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe.  Fine!  That's why there are vendors that don't use hashes to
de-dupe data.  Buy one of those instead.  Some use a weak hint +
bit-level verify, some use delta-differencing technologies, which are
bit-to-bit comparisons as well.

Visit our website at www.wilmingtontrust.com

Investment products are not insured by the FDIC or any other governmental 
agency, are not deposits of or other obligations of or guaranteed by Wilmington 
Trust or any other bank or entity, and are subject to risks, including a 
possible loss of the principal amount invested. This e-mail and any files 
transmitted with it may contain confidential and/or proprietary information.  
It is intended solely for the use of the individual or entity who is the 
intended recipient.  Unauthorized use of this information is prohibited.  If 
you have received this in error, please contact the sender by replying to this 
message and delete this material from any system it may be on.

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu