Veritas-bu

Re: [Veritas-bu] Tapeless backup environments?

2007-09-26 16:11:31
Subject: Re: [Veritas-bu] Tapeless backup environments?
From: "bob944" <bob944 AT attglobal DOT net>
To: <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Wed, 26 Sep 2007 15:51:52 -0400
> Most of this while well documented seems to boil down to the same
> alarmist notion that had people trying to ban cell phones in gas
> stations.  The possibility that something untoward COULD 
> happen does NOT
> mean it WILL happen.  To date I don't know of a single gas pump

I can't speak for car fires, but I can speak for
checksums/hashes/fingerprints mapping to more than one set of data.
It's been demonstrated.  It happens.  It _has_ to happen.  It's the way
these data reductions work, and the reason why it's more convenient to
refer to small hashes of data rather than the full data for many
uses--this has been a programming commonplace since the '50s.  But
programmers know it's not a two-way street:  a set of data generates
only one checksum/hash/fingerprint, but one checksum/hash/fingerprint
maps to more than one set of data.  And that's fine, for a program that
takes this into account (either because it doesn't matter to the
program's logic or a secondary step checks the data).  As a trivial
example, reducing three-bit data to a two-bit checksum means that trying
to go backwards will retrieve the wrong three-bit data 50% of the time.
Bigger hashes and more sophisticated algorithms reduce the number of
times you get the wrong data; they don't eliminate it.

> If odds are so important it seems it would be important to worry about
> the odds that your data center, your offsite storage location and your
> Disaster Recovery site will all be taken out at the same time.

And if it's not important that the data you read may not be what was
written, don't let me stop you.  _The odds are_ that it'll be okay.  

> I also suggest the argument is flawed because it seems to imply that
> only the cksum is stored and no actual the data - it is original
> compressed data AND the cksum that result in the restore - 
> not the cksum alone.

If I get your meaning, you have an incorrect understanding of the
argument--nobody is talking about generating the original data from a
checksum.  As I said in what you quoted (trimmed here), every unique (as
determined by the implementation) "block" of data gets stored, once.  A
data stream is stored as a list of pointers or
checksums/hashes/fingerprints which refer to those common-storage
"blocks".  Any number of data streams will point to the same "block"
when they have it in common, and as many times as that "block" occurs in
their data stream.  To read the data stream later, the list of pointers
tells the implementation what "blocks" to retrieve and send back to the
file reader.  Now, if "foo" and "bar" both reduced to the same
checksum/hash/fingerprint when stored, somebody is going to receive the
wrong data when the stream(s) that had those data are read.  So sorry
about that corrupted payroll master file...


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu