Veritas-bu

Re: [Veritas-bu] Tapeless backup environments?

2007-09-26 10:40:21
Subject: Re: [Veritas-bu] Tapeless backup environments?
From: "Jeff Lightner" <jlightner AT water DOT com>
To: <bob944 AT attglobal DOT net>, <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Wed, 26 Sep 2007 09:58:12 -0400
Most of this while well documented seems to boil down to the same
alarmist notion that had people trying to ban cell phones in gas
stations.  The possibility that something untoward COULD happen does NOT
mean it WILL happen.  To date I don't know of a single gas pump
explosion or car fire that was traced to cell phone usage at the pump.
Oddly enough though no one monitors gas pumps to be sure users aren't
re-entering their vehicles and fires HAVE been traced to static
electricity caused by that.

If odds are so important it seems it would be important to worry about
the odds that your data center, your offsite storage location and your
Disaster Recovery site will all be taken out at the same time.

I also suggest the argument is flawed because it seems to imply that
only the cksum is stored and no actual the data - it is original
compressed data AND the cksum that result in the restore - not the cksum
alone.

-----Original Message-----
From: veritas-bu-bounces AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of bob944
Sent: Wednesday, September 26, 2007 4:03 AM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

cpreston:
> >Simplistically, it checksums the "block" and looks in a table of
> >checksums-of-"blocks"-that-it-already-stores to see if the identical
> ><ahem, anyone see a hole here?> data already lives there.  
> 
> To what hole do you refer? 

The idea that N bits of data can unambiguously be represented by fewer
than N bits.  Anyone who claims to the contrary might as well knock out
perpetual motion, antigravity and faster-than-light travel while they're
on a roll.

> I see one in your simplistic example, but
> not in what actually happens (which require a much longer technical
> explanation).

Hence my introduction that began with "[s]implistically."  But throw in
all the "much longer technical explanation" you like, any process which
compares a reduction-of-data to another reduction-of-data will sooner or
later return "foo" when what was originally stored was "bar."


cpreston:
> There are no products in the market that rely solely on a checksum to
> identify redundant data.  There are a few that rely solely on 
> a 160-bit
> hash, which is significantly larger than a checksum (typically 12-16

No importa.  The length of the checksum/hash/fingerprint and the
sophistication of its algorithm only affect how frequently--not
whether--the incorrect answer is generated.

> [...] The ability to forcibly create a hash collision means 
> absolutely nothing in the context of deduplication.

Of course it does.  Most examples in the literature concern storing
crafted-data-pattern-A ("pay me one dollar") in order for the data to be
read later as something different ("pay me one million dollars").  It
can't have escaped your attention that every day, some yahoo crafts
another buffer-or-stack overflow exploit; some of them are brilliant.
The notion that the bad guys will never figure out a way to plant a
silent data-change based on checksum/hash/fingerprint collisions is,
IMO, naive.

> What matters is the chance that two
> random chunks would have a hash collision. With a 128-bit and 160-bit
> key space, the odds of that happening are 1 in 2128 with MD5, and 1 in
> 2160 with SHA-1. That's 1038 and 1048, respectively. If you 

Grasshopper, the wisdom is not in the numbers, it is in remembering that
HTML will not paste into ASCII well.  But I suspect you mean "one in
2^128" or similar.

Those are impressive, and dare I guess, vendor-supplied, numbers.  And
they're meaningless.  We do not care about the odds that a particular
block "the quick brown fox jumps over the lazy dog"
checksums/hashes/fingerprints to the same value as another particular
block "now is the time for all good men to come to the aid of their
party."  Of _course_ that will be astronomically unlikely, and with
sufficient hand-waving (to quote your article:  the odds of a hash
collision with two random chunks are roughly
1,461,501,637,330,900,000,000,000,000 times greater than the number of
bytes in the known computing universe") these totally meaningless
numbers can seem important.

They're not.  What _is_ important?  To me, it's important that if I read
back any of the N terrabytes of data I might store this week, I get the
same data that was written, not a silently changed version because the
checksum/hash/fingerprint of one block that I wrote collides with
another cheksum/hash/fingerprint.  I can NOT have that happen to any
block--in a file clerk's .pst, a directory inode or the finance
database.  "Probably, it won't happen" is not acceptable.

> Let's compare those odds with the odds of an unrecoverable 
> read error on a typical disk--approximately 1 in 100 trillion

Bogus comparison.  In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data from
foo to bar on every read with no indication that it isn't the data that
was written.

> If you want to talk about the odds of something bad happening and not
> knowing it, keep using tape. Everyone who has worked with tape for any
> length of time has experienced a tape drive writing something that it
> then couldn't read.

That's not news, and why we've been making copies of data for, oh, 50
years or so.

> Compare that to successful deduplication disk
> restores. According to Avamar Technologies Inc. (recently acquired by
> EMC Corp.), none of its customers has ever had a failed restore.

Now _there's_ an unbiased source.


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
----------------------------------

CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential 
information and is for the sole use of the intended recipient(s). If you are 
not the intended recipient, any disclosure, copying, distribution, or use of 
the contents of this information is prohibited and may be unlawful. If you have 
received this electronic transmission in error, please reply immediately to the 
sender that you have received the message in error, and delete it. Thank you.

----------------------------------



_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu