Re: [Veritas-bu] Tapeless backup environments?

Pls read my other post about the odds of this happening.  With a decent
key space, the odds of a hash collision with a 160=bit key space are so
small that any statistician would call them zero.  1 in 2^160.  Do you
know how big that number is?  It's a whole lot bigger than it looks.
And those odds are significantly better than the odds that you would
write a bad block of data to a regular disk drive and never know it.

---
W. Curtis Preston
Backup Blog @ www.backupcentral.com
VP Data Protection, GlassHouse Technologies 

-----Original Message-----
From: bob944 [mailto:bob944 AT attglobal DOT net] 
Sent: Wednesday, September 26, 2007 4:03 AM
To: veritas-bu AT mailman.eng.auburn DOT edu
Cc: Curtis Preston
Subject: RE: [Veritas-bu] Tapeless backup environments?

cpreston:
> >Simplistically, it checksums the "block" and looks in a table of
> >checksums-of-"blocks"-that-it-already-stores to see if the identical
> ><ahem, anyone see a hole here?> data already lives there.  
> 
> To what hole do you refer? 

The idea that N bits of data can unambiguously be represented by fewer
than N bits.  Anyone who claims to the contrary might as well knock out
perpetual motion, antigravity and faster-than-light travel while they're
on a roll.

> I see one in your simplistic example, but
> not in what actually happens (which require a much longer technical
> explanation).

Hence my introduction that began with "[s]implistically."  But throw in
all the "much longer technical explanation" you like, any process which
compares a reduction-of-data to another reduction-of-data will sooner or
later return "foo" when what was originally stored was "bar."


cpreston:
> There are no products in the market that rely solely on a checksum to
> identify redundant data.  There are a few that rely solely on 
> a 160-bit
> hash, which is significantly larger than a checksum (typically 12-16

No importa.  The length of the checksum/hash/fingerprint and the
sophistication of its algorithm only affect how frequently--not
whether--the incorrect answer is generated.

> [...] The ability to forcibly create a hash collision means 
> absolutely nothing in the context of deduplication.

Of course it does.  Most examples in the literature concern storing
crafted-data-pattern-A ("pay me one dollar") in order for the data to be
read later as something different ("pay me one million dollars").  It
can't have escaped your attention that every day, some yahoo crafts
another buffer-or-stack overflow exploit; some of them are brilliant.
The notion that the bad guys will never figure out a way to plant a
silent data-change based on checksum/hash/fingerprint collisions is,
IMO, naive.

> What matters is the chance that two
> random chunks would have a hash collision. With a 128-bit and 160-bit
> key space, the odds of that happening are 1 in 2128 with MD5, and 1 in
> 2160 with SHA-1. That's 1038 and 1048, respectively. If you 

Grasshopper, the wisdom is not in the numbers, it is in remembering that
HTML will not paste into ASCII well.  But I suspect you mean "one in
2^128" or similar.

Those are impressive, and dare I guess, vendor-supplied, numbers.  And
they're meaningless.  We do not care about the odds that a particular
block "the quick brown fox jumps over the lazy dog"
checksums/hashes/fingerprints to the same value as another particular
block "now is the time for all good men to come to the aid of their
party."  Of _course_ that will be astronomically unlikely, and with
sufficient hand-waving (to quote your article:  the odds of a hash
collision with two random chunks are roughly
1,461,501,637,330,900,000,000,000,000 times greater than the number of
bytes in the known computing universe") these totally meaningless
numbers can seem important.

They're not.  What _is_ important?  To me, it's important that if I read
back any of the N terrabytes of data I might store this week, I get the
same data that was written, not a silently changed version because the
checksum/hash/fingerprint of one block that I wrote collides with
another cheksum/hash/fingerprint.  I can NOT have that happen to any
block--in a file clerk's .pst, a directory inode or the finance
database.  "Probably, it won't happen" is not acceptable.

> Let's compare those odds with the odds of an unrecoverable 
> read error on a typical disk--approximately 1 in 100 trillion

Bogus comparison.  In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data from
foo to bar on every read with no indication that it isn't the data that
was written.

> If you want to talk about the odds of something bad happening and not
> knowing it, keep using tape. Everyone who has worked with tape for any
> length of time has experienced a tape drive writing something that it
> then couldn't read.

That's not news, and why we've been making copies of data for, oh, 50
years or so.

> Compare that to successful deduplication disk
> restores. According to Avamar Technologies Inc. (recently acquired by
> EMC Corp.), none of its customers has ever had a failed restore.

Now _there's_ an unbiased source.


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu