Veritas-bu

Re: [Veritas-bu] Tapeless backup environments?

2007-09-26 17:37:03
Subject: Re: [Veritas-bu] Tapeless backup environments?
From: "bob944" <bob944 AT attglobal DOT net>
To: <veritas-bu AT mailman.eng.auburn DOT edu>
Date: Wed, 26 Sep 2007 17:15:08 -0400
> On Wed, Sep 26, 2007 at 04:02:49AM -0400, bob944 wrote:
> > Bogus comparison.  In this straw man, that 
> > 1/100,000,000,000,000 read error a) probably doesn't
> > affect anything because of the higher-level RAID array
> > it's in and b) if it does, there's an error, a
> > we-could-not-read-this-data, you-can't-proceed, stop,
> > fail, get-it-from-another-source error--NOT a silent
> > changing of the data from foo to bar on every read
> > with no indication that it isn't the data that
> > was written.
> 
> While I find the "compare only based on hash" a bit annoying
> for other reasons, the argument above doesn't convince me.
> 
> Disks, controllers, and yes RAID arrays can fail silently in
> all sorts of ways by either acknowledging a write that is not
> done, writing to the wrong location, reading from the wrong
> location, or reading blocks where only some of the data came
> from the correct location.  Most RAID systems do not verify
> data on read to protect against silent data errors on the
> storage, only against obvious failures.

Perhaps anything can have a failure mode where it doesn't alert--but in
a previous lifetime in hardware and some design, I saw only one
undetected data transformation that did not crash or in some way cause
obvious problems (intermittent gate in a mainframe adder that didn't
affect any instructions used by the OS).  

I don't remember a disk that didn't maintain, compare and _use for error
detection_, the cylinder, head and sector numbers in the format.  

The write frailties mentioned, if they occur, will fail on read.  And
the read frailties mentioned will generally (homage paid to the
mainframe example I cited as the _only_ one I ever saw that didn't)
cause enough mayhem that apps or data or systems go belly-up in a big
way, fast.  

These events, like double-bit parity errors or EDAC failures, involve
1.  that something breaks in the first place
2.  that it not be reported
3.  that the effects are so subtle that they are unnoticed (the app or
system doesn't crash, the data aren't wildly corrupted, ...)

The problem with checksumming/hashing/fingerprinting is that the
methodology has unavoidable errors designed in, and an implementation
with no add-on logic to prevent or detect them will silently corrupt
data.  That's totally different, IMO.


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu