BackupPC-users

Re: [BackupPC-users] Centralized storage with multiple hard drives

2014-03-19 19:35:58
Subject: Re: [BackupPC-users] Centralized storage with multiple hard drives
From: Holger Parplies <wbppc AT parplies DOT de>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Thu, 20 Mar 2014 00:26:02 +0100
Hi,

Les Mikesell wrote on 2014-03-19 11:25:38 -0500 [Re: [BackupPC-users] 
Centralized storage with multiple hard drives]:
> On Wed, Mar 19, 2014 at 5:53 AM, thorvald
> [...]
> With rsync/rsyncd xfers, you would at least get hardlinks to identical
> files in different runs on the same target.

make that "same files" in the sense of the XferLOG message - files that are
unchanged in comparison to the reference backup won't need extra space. If
you have identical content within one backup (say 100 copies of CVS/Root),
you'll still have individual storage for each instance of the content.
You get exactly that from 'rsync --link-dest' by the way.

Les Mikesell wrote on 2014-03-19 14:35:17 -0500 [Re: [BackupPC-users] 
Centralized storage with multiple hard drives]:
> On Wed, Mar 19, 2014 at 1:48 PM, Timothy J Massey <tmassey AT obscorp DOT 
> com> wrote:
> >
> > > Let's say that the storage is not a problem for me and I can have as
> > > many TB or PT as I need. However the main assumption is that every
> > > box has got a separate "disk" to be backed up to. So now I faced the
> > > problem with BackupPC which does use pool or cpool to store files
> > > within :/. I don't need any compression or deduplication. Is there
> > > any way to backup files directly to pc/HOST/ instead ?
> >
> >
> > I am going to give a flat "no" to this.  You may be able to break things
> > within BackupPC to accomplish this (never run the link, for example), but
> > you are *breaking* things.  Don't do that if you expect *anyone* to be
> > able to help you.
> 
> I don't know about that

Well, I do. You'd be running modified code that behaves contrary to
expectations on this list and that isn't what you expect it to be: tested and
proven. You'd get misleading help because we don't know what you modified and
how. You'd waste your and our time sorting this out.

The point here is that BackupPC may just not be the right tool. You can
probably modify apache to serve TFTP or DNS requests, but why on earth would
you do that if those are the only things you're going to use it for? There are
other tools that already do those jobs without modification. BackupPC is about
deduplication. It puts a lot of effort into this (in terms of code path, CPU
and disk utilisation). If all you really need is a smart rsync invocation and
an expiration logic, then why incur the unneeded overhead literally hundreds
of times?

> - people on the list report fairly often that they are using too much disk
> space and it turns out that links have been failing for one reason or
> another - but they still have working backups.

Right. And the first thing we tell them to do is: fix linking. That's not
interesting. The question is, what do we tell them when they *don't* have
working backups? "Find out why linking isn't working. Your problem is probably
related to the cause of that. If not, start fresh with working linking. If
your problem persists, then come back."

> > > I'm not going to backup couple of hundreds servers using one
> > > BackupPC instance of course but I want to back up at least 100
> > > servers per BackupPC instance.
> > >
> > > Is there something you could advise me ?
> >
> > Sure:  use virtualization.  Create your huge datastore (or multiple
> > datastores) and create a VM for each unit that needs its own pool.

That doesn't seem to fit "at least 100 servers per BackupPC instance" (and
"separate disk for each host").

> Interesting concept, but it seems like it would add a horrible amount
> of overhead in terms of setup and maintenance - even just tracking
> which VM does which backup.   Although - maybe it would mesh with
> whatever is driving the idea of keeping the backups separate.

Maybe. We're just guessing.

> You'd have scheduling issues that a single server would sort out, though.

Right, and you probably need synchronisation of backups for your network
bandwidth, even if not for the disks.

> > There are other things that you'll have to worry about, no matter whether
> > it's a single instance or multiple VM's.  The screamingly obvious one is
> > disk performance. [...]
> 
> Throwing RAM at a disk performance problem usually helps.

You've used BackupPC before, Les, right? ;-)
BackupPC prefers pool reads over writes when possible, and it typically
accesses large amounts of data almost randomly. Caching metadata will help,
caching data likely won't. The benefit from cached metadata is mainly in the
{c,}pool structure, I would expect, which would be removed - along with a lot
of the disk reads - if you have no (working) pool. The part of the problem you
are trying to fix most probably vanishes.
On the other hand, eliminating pooling means that each file in your backup
set will be stored independently and accessed in roughly the same order on
each backup. Unless you can cache the complete data from a backup (and keep it
until the next backup runs), you gain nothing from caching any data.

> As does not using raid5.

One disk per client host sort of precludes raid5 ;-).

Regards,
Holger

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/