BackupPC-users

Re: [BackupPC-users] Woe is me

2010-05-01 11:19:40
Subject: Re: [BackupPC-users] Woe is me
From: John Rouillard <rouilj-backuppc AT renesys DOT com>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Sat, 1 May 2010 15:17:56 +0000
On Sat, May 01, 2010 at 09:45:03AM -0500, Les Mikesell wrote:
> Andrew Schulman wrote:
> >> I just logged in 
> >> to do my Sysadmin checks and found 3 bloody disks have failed totally 
> >> ruining my BackupPC filesystem.
> > 
> > The odds against that are overwhelming.  Power surge?  Are all
> > three disks of the same age and lot?
> 
> Things like that aren't as unusual as you might think.  What often
> happens is that unused parts of some of the disks go bad and aren't
> noticed immediately - then when a used area on one does have a
> problem it tries to rebuild on a hot spare but in the process of
> building parity across all the sectors it hits the bad spots on the
> other drives.  Or, if it is really a controller problem it can
> affect several disks at once.

Yup. Just had to do a rebuild and have the rebuild ignore ecc errors
on the remaining drives to get a "successful" rebuild. This is with a
3ware raid controller with scheduled scrubbing of the disks that
should catch this sort of stuff. Sadly Raid X != backups.

The one time I had a data loss experience, it was due to a failed disk
followed by a failed rebuild that wasn't detected by the monitoring
software. So when the second drive failed (on a raid 5), we were done.

However one thing to check when the rebuild is done is that there are
disks from different build lots (usually you can tell the lot from the
first few digits of the serial numbers of the disks). I once got a
clarion raid unit with all (48??) disks from the same lot. Had 6
failures within 1 month because all of the disks shared the same
failure curve. Fortunately it was configured as multiple raid 5 units,
so we managed to survive multiple concurrent disk failures but it was
troubling to say the least.

I know one site that runs multiple raid servers and they take any new
raid box and swap out it's drives into existing units. Thus
scrambling newer drives into every one of their raid units and taking
the mix of older drives into the new unit. That's one way of cross
pollinating the drives I guess.

--
                                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

------------------------------------------------------------------------------
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>