BackupPC-users

Re: [BackupPC-users] BackupPC and MooseFS?

2011-05-23 15:04:14
Subject: Re: [BackupPC-users] BackupPC and MooseFS?
From: Holger Parplies <wbppc AT parplies DOT de>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Mon, 23 May 2011 21:04:09 +0200
Hi,

Mike wrote on 2011-05-20 09:05:11 -0300 [Re: [BackupPC-users] BackupPC and 
MooseFS?]:
> [...]
> being able to say "I want to have 2 copies of anything in this directory 
> and 3 copies of anything in this directory" is very nice. [...]

Les Mikesell wrote on 2011-05-23 11:12:18 -0500 [Re: [BackupPC-users] BackupPC 
and MooseFS?]:
> On 5/21/2011 7:24 PM, Scott wrote:
> > But does moosefs basically duplicate the data, so if you have 2tb of
> > backuppc data, you need a moosefs with 2tb of storage to duplicate the
> > whole thing?
> 
> Yes, it gives the effect of raid1 mirrors -

from what Mike wrote, shouldn't you be able to say "I want to have only one
copy of anything in <some directories> and two copies of everything else"?
With BackupPC, that doesn't seem to make any sense (but: see below) - why
would you want to replicate only part of the pool, and why only files that
happen to have a partial file md5sum starting with certain letters? You could
limit log files to a single copy, but is that enough data to even worry about?
This does bring up questions, though: how does it handle hardlinks, if you
determine numbers of copies by directory, i.e. how many copies do you get,
if a file is in one directory where you want three copies and in another
directory where you chose two copies?
If the answer is "five copies", it won't work with BackupPC ;-).

Mike, can you test what it does with hard links, e.g. by creating a large file
with several links?

I'm just asking, because with normal UNIX file system usage patterns, you
could probably get away with cheating (and creating five copies) without
anyone complaining (or even noticing). Then again, the mechanism might be
totally different, like putting the number of copies in the inode (and
inheriting from the parent directory on file creation; presuming it *has* an
own inode and doesn't just use a different FS for local storage). If that is
the case, you could even conceivably have some hosts' data replicated X
times and other hosts' data Y times (e.g. Y=1) by tagging the appropriate pc/
directories accordingly. Only problem: *shared* data (file contents appearing
on hosts in both sets) would 'randomly' have X or Y copies, depending on which
set of hosts happened to contain the file first (but you could probably adjust
that later and "watch all the unmet goals get resolved" ;-).

> but if I understand it correctly the contents can be distributed across
> several machines instead of needing space for a full copy of even a single
> instance of the whole filesystem on any single machine or drive.

The way BackupPC works (heavily relying on fast read performance), I would
expect it to be important for performance to have a full copy of the file
system locally on the BackupPC server. Is there a way to enforce that?

Another consideration would be, how well does it handle a large backlog of
unmet goals? If you're replicating over a comparatively slow connection, you
might need to spread out updates to the "mirror(s)" over more time than your
backup window contains. Does a large backlog of unmet goals deplete system
memory needed for caching?

Mike wrote:
> I haven't tried backuppc on it yet, but storing mail in maildir folder
> works well, and virtual machine images work well.

Unfortunately, both of these examples don't resemble BackupPC's disk usage.
Virtual machine images are possibly high-bandwidth single large file
operations, maildir folders use many small files, but the bandwidth is
probably severely limited by your internet connection and MTA processing
(DNSBL lookups, sender verification, Spamassassin, ...). Reading mail is
limited by your POP or IMAP server's processing speed (well, or NFS). And
all of that only happens if there is actually incoming mail or users
checking their mailbox, which you probably don't have at a sustained high
rate for longer periods of time.

While BackupPC's performance may also be limited by link bandwidth or
client speed, from what I read on this list, server disk performance seems
to be the most important limiting factor.

So, while your results are encouraging, we still simply need to try it out,
unless we can establish a reason why it won't work. For any meaningful
results, it would be best to have an alternate BackupPC server with
"conventional" storage (and comparable hardware) backing up the same clients
(but not at the same time) to compare backup performance with.

Regards,
Holger

------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/