Re: [BackupPC-users] Stop TrachClean / Return directories backup list

On 21/03/13 12:28, Phil Kennedy wrote:
> On 3/20/2013 9:12 PM, Holger Parplies wrote:
>> I've had that happen (except that I noticed before a drive broke) at least
>> once, and I remember that Les has also. From what I remember of his
>> explanation (please correct me if I'm wrong), two physical disks concurrently
>> positioning their heads can disturb each other (through vibration) in such a
>> way that one of them returns a read or write error and is kicked out of the
>> array without the drive actually being in any way defective. I *would*
>> consider this a shortcoming of Linux software RAID-1.
>>
>> As Adam wrote, you can easily monitor that. It still is a nuisance, though.
> As an aside, i've seen drives in other backuppc / software RAID 
> instances fail for no good reason, to the point that they pass long 
> smartctl test, yet mdadm is still convinced that the drive is bad. 
> Perhaps the vibration issue you've described was the culprit then?

This is perhaps getting a little off-topic for this list, but if you are
interested in these issues, I would suggest the linux-raid list has a
lot of very knowledgeable people with a lot to say about these sorts of
problems.
As just one possible explanation, you are using "cheap" drives without
properly configuring them.
ie, if the drive has a problem reading a sector from the drive, then the
drive will try to read the sector (try really hard), what usually
happens is the controller or linux driver will timeout while waiting and
ask the drive to reset, etc etc... eventually it will think the drive is
not responding (because it is still trying to read the sector it had a
problem with), and so it will be kicked from the array as a failed
drive. There are ways to resolve this, either telling the drive to
timeout much more quickly (usually about 7 seconds or less) or telling
linux not to be so impatient and wait much longer for the drive to
return the failed read (a number of minutes). From memory, if the drive
supports ECT then this works. On "RAID or Enterprise" drives, the
default is usually to timeout a failed read within a few seconds,
because then the RAID can simply read that data from another drive.
Linux software raid will notice the read failure and attempt to re-write
the failed sector by using data from the other drives. The failed sector
will either re-write successfully, or be transparently relocated by the
drive. If the write fails, then the drive is kicked from the array.

Search keywords like URE (Unrecoverable Read Error), ERC/ECT or just
check the linux-raid mailing list, there is an email about this issue
frequently.

I've *never* had drives being randomly kicked from an array except where
either the above was happening, or SATA driver issues. In any case, with
proper monitoring, this is almost a non-event.

I'm not suggesting this was your issue, nor anybody else's, just
suggesting that appears to be a much more common cause of perfectly good
drives being randomly kicked from a raid array, as opposed to
"vibration" issues.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/