Re: [BackupPC-users] RAID and offsite

Hi,

Michael Conner wrote on 2011-04-27 10:27:18 -0500 [Re: [BackupPC-users] RAID 
and offsite]:
> On Apr 26, 2011, at 12:08 PM, Les Mikesell wrote:
> > On 4/26/2011 11:38 AM, Michael Conner wrote:
> >> [...]
> >> Someone used a RAID 1 setup but only put in the second disk periodically,
> >> then removed it for offsite storage. I have three 2T drives, so was
> >> considering something similar where I would keep a normal 2-disk RAID 1
> >> setup but periodically remove one disk and replace it with a prior
> >> offsite disk.

just to summarize what has been posted so far:

1.) Having an *additional* disk (i.e. 3-disk RAID 1 with 2 permanent and 1
    "offsite" member) protects you against single disk failures during rebuild.
    Other failures (software, hardware, controller, lightning, etc.) can still
    do harm, so it is still not perfect, but I think there is no disagreement
    on that the additional RAID member does add protection against one very
    real failure scenario.

2.) You really need more than one "offsite" disk, if you are taking "offsite"
    seriously. I.e. bringing the disk on-site, failing one RAID member, adding
    the previous offsite disk, and then taking the new offsite disk off-site
    will temporarily have all disks on-site. That may or may not be of concern
    for you, but it is worth emphasizing.
    On the other hand, first failing one RAID member, taking it off-site, then
    bringing in the other disk and adding it, will leave you with a degraded
    RAID for a considerable amount of time (and may not work for you, depending
    on how often you want to resync).

With just 4 disks, you can have both a permanent 2-way RAID 1 (3 members, one
only connected for resync) and one copy always offsite. Normally, you keep
both "offsite" disks offsite, and bring them in alternately to resync.

> > [...]
> > But, note that even though you don't technically have to stop/unmount 
> > the raid while doing the sync, realistically it doesn't perform well 
> > enough to do backups at the same time. I use a cron job to start the 
> > sync very early in the morning so it will complete before backups would 
> > start.

How do you schedule the sync? (Or are you just talking about hot-adding the
disk via cron?)

> All my sata drives are "external" internals. That is, they are connected to
> PCI sata controller but since there are no bays to install them in the
> computer chasis, I just run the cables outside through a PCI slot bar.
> Still have to figure out the a long-term housing solution. At least they
> are easy to access.

I don't think eSATA has any real disadvantages over SATA performance wise.
Sure, you have external cabling and one or more separate power supplies as
additional points of failure. But if you have that anyway, you might as well
use standard cables that somewhat facilitate handling. Or buy a computer
chassis that will accommodate your drives (and use eSATA for the offsite
drive(s)).

> So I would be ok doing something like this:
> Stop BPC process
> Unmount raid array (md0 made up of sda1 and sdb1)
> Use mdadm to remove sdb1 from the array

Assuming you want to remount your file system and restart BackupPC, you can do
so at this point (or later). As Les said, your performance may vary :).

> Take off the sdb drive, attach offsite one in its place

Assuming your kernel/SATA-driver/SATA-chipset can handle hotswapping ...
otherwise you'd need to reboot here.

> Use mdadm to add sdb1 to md0 and reconstruct
> 
> Maybe cycle through whether I remove sda or sdb so all drives get used
> about the same amount over time.

I'm sure that's a point where we'll all disagree with each other :-).

Personally, I wouldn't use a common set of disks for normal backup operation
and offsite backups. BackupPC puts considerable wear on its pool disks. At
some point in time, you'll either have failing disks or proactively want to
replace disks before they start failing. Are you sure you want to think about
failing pool disks and failing offsite backup disks at the same time (i.e.
correlated)? I assume, failing pool disks are one of the things you want to
protect against with offsite backups. So why use backup media that are likely
to begin failing just when you'll need them?

> My main concerns were: can I remount and use md0 while it is rebuilding and
> that there is no danger of the array rebuilding to the state of the newly
> attached drive (I'm very paranoid).

I can understand that. I used RAID 1 in one of my computers (root FS, system,
data) for a time simply for the purpose of gaining experience with RAID 1. I
didn't notice much (except for the noise of the additional disk) until one
disk had some sort of problem. I don't remember the details, but I recall that
I had expected the computer to boot unattendedly (well, the 'reboot' was
manual ... or was it actually a crash that triggered the problem?), which it
didn't. I think it brought up the *wrong* (i.e. faulty) disk of the mirror and
failed on an fsck. Physically removing the faulty disk "corrected" the problem.
Somewhat disappointing. What's more, *both* disks are now working flawlessly
in separate computers, so I'm really clueless what the problem was in the
first place. Sounds like a software error, much like in Jeffrey's case.

On the other hand, on the computers where it matters (servers, BackupPC), RAID
1 has been running for years without a real problem (I *have* seen RAID members
dropped from an array without understandable reasons, but, mostly, re-adding
them simply worked; more importantly, there was no interruption of service).

I guess that simply means: test it before you rely on it working. Many people
are using Linux RAID 1 in production environments, so it appears to work well
enough, but there are no guarantees your specific
software/kernel/driver/hardware combination will not trigger some unknown (or
unfixed ;-) bug.

It *would* help to understand how RAID event counts and the Linux RAID
implementation in general work. Has anyone got any pointers to good
documentation?

> I assume that as long as I use mdadm to remove and add sdb, it will use sda
> as the base (or vice versa).

I feel at least *that* assumption should be safe :-). I just wouldn't want to
wait for a 2TB resync to complete ... or think too much about what happens if
it doesn't. For me, resyncs and interrupted resyncs seem to work, but then,
that is for disks that *are* normally synchronized, and where I probably
wouldn't notice if something was copied in the wrong direction, unless it
happened to be FS metadata that would lead to a corrupt FS.

So, there's good reason to believe it will "simply work", but it's also a
complex enough matter to justify a fair amount of scepticism.

Hope that helps.

Regards,
Holger

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/