Re: [BackupPC-users] RAID and offsite

On 4/28/11 9:50 PM, Holger Parplies wrote:
>
>>> [...]
>>> But, note that even though you don't technically have to stop/unmount
>>> the raid while doing the sync, realistically it doesn't perform well
>>> enough to do backups at the same time. I use a cron job to start the
>>> sync very early in the morning so it will complete before backups would
>>> start.
>
> How do you schedule the sync? (Or are you just talking about hot-adding the
> disk via cron?)

I have trayless hot-swap SATA bays and physically put the disk in the day 
before, then have an 'mdadm --add ... ' command in cron at about 3 am when the 
backups are predictably complete.  The disk is recognized automatically when 
inserted but isn't used until the mdadm command adds it.  Normally I break the 
raid and remove it at the end of the day, but it doesn't really hurt to leave 
it 
in as long as the sync completes before the nightly runs start.

>> Take off the sdb drive, attach offsite one in its place
>
> Assuming your kernel/SATA-driver/SATA-chipset can handle hotswapping ...
> otherwise you'd need to reboot here.

Most do - although I do have a Paradise card that doesn't.

>> Use mdadm to add sdb1 to md0 and reconstruct
>>
>> Maybe cycle through whether I remove sda or sdb so all drives get used
>> about the same amount over time.
>
> I'm sure that's a point where we'll all disagree with each other :-).
>
> Personally, I wouldn't use a common set of disks for normal backup operation
> and offsite backups. BackupPC puts considerable wear on its pool disks. At
> some point in time, you'll either have failing disks or proactively want to
> replace disks before they start failing. Are you sure you want to think about
> failing pool disks and failing offsite backup disks at the same time (i.e.
> correlated)? I assume, failing pool disks are one of the things you want to
> protect against with offsite backups. So why use backup media that are likely
> to begin failing just when you'll need them?

I don't think there is anything predictable about disk failure. Handling them 
is 
probably bad.  Normal (even heavy) use doesn't seem to matter unless maybe they 
overheat.

>> My main concerns were: can I remount and use md0 while it is rebuilding and
>> that there is no danger of the array rebuilding to the state of the newly
>> attached drive (I'm very paranoid).
>
> I can understand that. I used RAID 1 in one of my computers (root FS, system,
> data) for a time simply for the purpose of gaining experience with RAID 1. I
> didn't notice much (except for the noise of the additional disk) until one
> disk had some sort of problem. I don't remember the details, but I recall that
> I had expected the computer to boot unattendedly (well, the 'reboot' was
> manual ... or was it actually a crash that triggered the problem?), which it
> didn't. I think it brought up the *wrong* (i.e. faulty) disk of the mirror and
> failed on an fsck. Physically removing the faulty disk "corrected" the 
> problem.
> Somewhat disappointing. What's more, *both* disks are now working flawlessly
> in separate computers, so I'm really clueless what the problem was in the
> first place. Sounds like a software error, much like in Jeffrey's case.

Grub doesn't know about raid and just happens to work with raid1 because it 
treats the disk as a single drive.  What happens when booting the 2nd member 
depends on how your bios treats the drive, whether bios and grub agree on the 
device identifier after booting, and whether is was what you expected when you 
installed grub on it (when you still had a working primary drive).  And back in 
IDE days, a drive failure usually locked the controller which might have had 
another drive on the same cable.

> On the other hand, on the computers where it matters (servers, BackupPC), RAID
> 1 has been running for years without a real problem (I *have* seen RAID 
> members
> dropped from an array without understandable reasons, but, mostly, re-adding
> them simply worked; more importantly, there was no interruption of service).

I've seen that too.  I think retries are much more aggressive on single disks 
or 
the last one left in a raid than on the mirror.

> I guess that simply means: test it before you rely on it working. Many people
> are using Linux RAID 1 in production environments, so it appears to work well
> enough, but there are no guarantees your specific
> software/kernel/driver/hardware combination will not trigger some unknown (or
> unfixed ;-) bug.

I had a machine with a couple of 4-year uptime runs (a red hat 7.3) where 
several of the scsi drives failed and were hot-swapped and re-synced with no 
surprises.  So unless something has broken in the software recently, I mostly 
trust it.

> It *would* help to understand how RAID event counts and the Linux RAID
> implementation in general work. Has anyone got any pointers to good
> documentation?

I've never seen it get this wrong when auto-assembling at reboot (and I move 
disks around frequently and sometimes clone machines by splitting the mirrors 
into different machines), but it shouldn't matter in the BPC scenario because 
you are always manually telling it which partition to add to an already running 
array.

-- 
    Les Mikesell
     lesmikesell AT gmail DOT com

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/