Re: [Veritas-bu] Tapeless backup environments?

It's interesting that the probability of any 2 randomly selected hashs 
being the same is quoted, rather than the probability that at least 2 
out of a whole group are the same. That's probably because the minutely 
small chance becomes rather bigger when you consider many hashs. This 
will still be small, but I suspect not as reassuringly small.

To illustrate this consider the 'birthday paradox'. How many people do 
you need in a room to have at least a 50% chance that 2 of them have the 
same birthday? The chance of any 2 randomly chosen people sharing the 
same birthday is 1/365 (neglecting leap years). Thats quite small, so we 
need a lot of people to get a 50% chance, right? Wrong. You need 23 
people. Google for 'birthday paradox' for the simple maths.

For our data I would certainly not use de-duping, even if it did work 
well on image data.


bob944 wrote:
> cpreston:
>>> Simplistically, it checksums the "block" and looks in a table of
>>> checksums-of-"blocks"-that-it-already-stores to see if the identical
>>> <ahem, anyone see a hole here?> data already lives there.  
>> To what hole do you refer? 
> 
> The idea that N bits of data can unambiguously be represented by fewer
> than N bits.  Anyone who claims to the contrary might as well knock out
> perpetual motion, antigravity and faster-than-light travel while they're
> on a roll.
> 
>> I see one in your simplistic example, but
>> not in what actually happens (which require a much longer technical
>> explanation).
> 
> Hence my introduction that began with "[s]implistically."  But throw in
> all the "much longer technical explanation" you like, any process which
> compares a reduction-of-data to another reduction-of-data will sooner or
> later return "foo" when what was originally stored was "bar."
> 
> 
> cpreston:
>> There are no products in the market that rely solely on a checksum to
>> identify redundant data.  There are a few that rely solely on 
>> a 160-bit
>> hash, which is significantly larger than a checksum (typically 12-16
> 
> No importa.  The length of the checksum/hash/fingerprint and the
> sophistication of its algorithm only affect how frequently--not
> whether--the incorrect answer is generated.
> 
>> [...] The ability to forcibly create a hash collision means 
>> absolutely nothing in the context of deduplication.
> 
> Of course it does.  Most examples in the literature concern storing
> crafted-data-pattern-A ("pay me one dollar") in order for the data to be
> read later as something different ("pay me one million dollars").  It
> can't have escaped your attention that every day, some yahoo crafts
> another buffer-or-stack overflow exploit; some of them are brilliant.
> The notion that the bad guys will never figure out a way to plant a
> silent data-change based on checksum/hash/fingerprint collisions is,
> IMO, naive.
> 
>> What matters is the chance that two
>> random chunks would have a hash collision. With a 128-bit and 160-bit
>> key space, the odds of that happening are 1 in 2128 with MD5, and 1 in
>> 2160 with SHA-1. That's 1038 and 1048, respectively. If you 
> 
> Grasshopper, the wisdom is not in the numbers, it is in remembering that
> HTML will not paste into ASCII well.  But I suspect you mean "one in
> 2^128" or similar.
> 
> Those are impressive, and dare I guess, vendor-supplied, numbers.  And
> they're meaningless.  We do not care about the odds that a particular
> block "the quick brown fox jumps over the lazy dog"
> checksums/hashes/fingerprints to the same value as another particular
> block "now is the time for all good men to come to the aid of their
> party."  Of _course_ that will be astronomically unlikely, and with
> sufficient hand-waving (to quote your article:  the odds of a hash
> collision with two random chunks are roughly
> 1,461,501,637,330,900,000,000,000,000 times greater than the number of
> bytes in the known computing universe") these totally meaningless
> numbers can seem important.
> 
> They're not.  What _is_ important?  To me, it's important that if I read
> back any of the N terrabytes of data I might store this week, I get the
> same data that was written, not a silently changed version because the
> checksum/hash/fingerprint of one block that I wrote collides with
> another cheksum/hash/fingerprint.  I can NOT have that happen to any
> block--in a file clerk's .pst, a directory inode or the finance
> database.  "Probably, it won't happen" is not acceptable.
> 
>> Let's compare those odds with the odds of an unrecoverable 
>> read error on a typical disk--approximately 1 in 100 trillion
> 
> Bogus comparison.  In this straw man, that 1/100,000,000,000,000 read
> error a) probably doesn't affect anything because of the higher-level
> RAID array it's in and b) if it does, there's an error, a
> we-could-not-read-this-data, you-can't-proceed, stop, fail,
> get-it-from-another-source error--NOT a silent changing of the data from
> foo to bar on every read with no indication that it isn't the data that
> was written.
> 
>> If you want to talk about the odds of something bad happening and not
>> knowing it, keep using tape. Everyone who has worked with tape for any
>> length of time has experienced a tape drive writing something that it
>> then couldn't read.
> 
> That's not news, and why we've been making copies of data for, oh, 50
> years or so.
> 
>> Compare that to successful deduplication disk
>> restores. According to Avamar Technologies Inc. (recently acquired by
>> EMC Corp.), none of its customers has ever had a failed restore.
> 
> Now _there's_ an unbiased source.
> 
> 
> 

-- 
Do you want a picture of your brain - volunteer for a brain scan!
http://www.fil.ion.ucl.ac.uk/Volunteers/

Computer systems go wrong - even backup systems
Be paranoid!

Chris Freemantle, Data Manager
Wellcome Trust Centre for Neuroimaging
+44 (0)207 833 7496
www.fil.ion.ucl.ac.uk
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu