BackupPC-users

Re: [BackupPC-users] Newbie setup questions

2011-03-11 15:15:46
Subject: Re: [BackupPC-users] Newbie setup questions
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Fri, 11 Mar 2011 15:13:21 -0500
Cesar Kawar wrote at about 18:27:34 +0100 on Friday, March 11, 2011:
 > 
 > El 11/03/2011, a las 14:59, Jeffrey J. Kosowsky escribió:
 > 
 > > Cesar Kawar wrote at about 10:08:10 +0100 on Friday, March 11, 2011:

 > > I think rsync uses little if any cpu -- after all, it doesn't do much
 > > other than do delta file comparisons and some md4/md5
 > > checksums. All much more rate-limited by network bandwidth and disk
 > > i/o.
 > 

 > Not at all. essentially, rsync was designed exactly for the
 > opposite goals of the ones you mentioned. rsync is bandwidth
 > friendly, but it is very cpu expensive. 

Of course it is bandwidth friendly, but we are talking about the
hard-link case, where memory seem to typically be the rate limiting
case. Also, even without any-hardlinks, I find just the disk i/o &
network bandwith to transmit file listings & stat'ing and to send the block
checksums is limiting on even underpowered machines.

 > The amount of memory needed is much less important than the cpu
 > needed. Again, from rsync FAQ page:
 > 
 >      "Rsync needs about 100 bytes to store all the relevant information
 >      for one file, so (for example) a run with 800,000 files would
 >      consume about 80M of memory. -H and --delete increase the memory
 >      usage further."
 > 

You need to re-read that CRITICAL last sentence. Rsyncing without hard
links scales very nicely and indeed uses little memory and minimal
cpu. Rsyncing with pool hard links uses *tons* of memory. Been there,
done that!


 > My firefox requires about double of that memory just to open
 > www.google.com I know that is "only" to process 800,000 files, but
 > with version 3.0.0 and later, it doesn't load all the files at
 > once. With a 512 Mb computer you'll be fine, but in the particular
 > installation I was talking before, 1 Tb of data comprised of 1 year
 > of historical data (that means a really big number of hardlink per
 > file), the syncing process takes almost 100% CPU on an Intel Xeon
 > Quad Core for about 2 hours.

Have you every *actually* tried rsyncing a pool of 800,000 files on a
computer with 512MB memory?
I tried rsyncing a pool of 300,000 files with only maybe a couple
dozen backups and it took days on a computer with 2GB. Again the CPU
was not a problem. 

I'm surprised you could even rsync 1Tb of massively linked files in 2
hours. Unless you have just a small number of large files.


 > rsync is a really cpu expensive process. You can always use caching
 > for md5 chesums process, but, I wouldn't recommend that on an
 > off-site replicated backup. Caching introduces a small probability
 > of loosing data, and that technique is already used when doing a
 > normal BackupPC backup with rsync transfer, so, if you then resync
 > that data to another drive, disk of filesystem of any kind, your
 > probability of loosing data is a power of the original one.

First, the cpu consumption (for BackupPC archives) is *not* in the
md5sums but is in the hard linking (you can verify this by doing an
rsync on the pool alone or rsyncing TopDir without the -H
flag). Moreover, the cpu requirements for the rolling md5sum checksums
are actually much less for BackupPC archives than for normal files
since you actually rarely need to do the "rolling" part which is the
actual cpu-intensive part. This is because you only do rolling when
files change and pool files only change in the relatively rare event
of chain renumbering plus in the case of the rsync method with checksum
caching in the one-time-only event when digests are added (but this
only affects the first and last blocks).

So, to the extent that you are cpu-limited, the problem is not with
md5sums but with hard links which requires both memory to store the
hard link list (which is limiting on many machines) plus some cpu
intensity to search the list -- specifically rsync requires that for
each hard linked file (which for BackupPC is *every* file), you need
to do a binary search of the hard link list (which in BackupPC is
every file). Also, I imagine that rsync was not optimized for the
extreme edge case represented by BackupPC archives where (just about)
*every* non-zero length file is hard linked. 

The bottom line is that checksum caching is unlikely to have any
significant effect.

Second, regarding your concern of compounding checksum errors, a power
of a small error is still small.  However, that is not even really the
case here since the only thing one would need to worry about here is
the false negative of having matching checksums but corrupted file
data. But this error is not directly compounded by the BackupPC
checksuming since it is an error in the data itself. (Note the other
potential false negative of md5sum collisions in the block data is
vanishingly small particularly given both block checksums and file
checksums). False positives only at worse cause an extra rsync copy
operation. 

More generally, if you are truly worried about the compounding of
small errors then by extension you should never be backing up
archive backups. I mean any backup has some probability of error (due
to disk errors, ram errors, etc.) so a backup of a backup then has a
power of that original error.

 > Not recomended I think.  I prefer to expend a little more money on
 > the machine once and not have surprises later on when the big boss
 > ask you to recover his files....

If you worry about compounding of errors in backups then probably
better to have two parallel BackupPC servers rather than backing up a
backup -- since all errors compound and as above, I think a faulty
checksum is not your most likely error source.

 > I don't have graphs, but the amount of memory available to any
 > recent computer is more than enough for rsync. Disk I/O is somewhat
 > important, and disk bandwidth is a constraint, but, cpu speed is
 > the more important thing in my tests.

Interesting, based on my experience and the experience of most reports
on this mailing list, memory is the main problem encountered. But
perhaps if you have enough memory, the repeated binary search of the
hard link list is the issue. Maybe rsync could be written better for
this case to presort the file list by inode number or something like that.

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/