BackupPC-users

Re: [BackupPC-users] Newbie setup questions

2011-03-11 17:09:56
Subject: Re: [BackupPC-users] Newbie setup questions
From: Cesar Kawar <kawarmc AT gmail DOT com>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Fri, 11 Mar 2011 23:07:53 +0100
El 11/03/2011, a las 21:13, Jeffrey J. Kosowsky escribió:

> Cesar Kawar wrote at about 18:27:34 +0100 on Friday, March 11, 2011:
>> 
>> El 11/03/2011, a las 14:59, Jeffrey J. Kosowsky escribió:
>> 
>>> Cesar Kawar wrote at about 10:08:10 +0100 on Friday, March 11, 2011:
> 
>>> I think rsync uses little if any cpu -- after all, it doesn't do much
>>> other than do delta file comparisons and some md4/md5
>>> checksums. All much more rate-limited by network bandwidth and disk
>>> i/o.
>> 
> 
>> Not at all. essentially, rsync was designed exactly for the
>> opposite goals of the ones you mentioned. rsync is bandwidth
>> friendly, but it is very cpu expensive. 
> 
> Of course it is bandwidth friendly, but we are talking about the
> hard-link case, where memory seem to typically be the rate limiting
> case. Also, even without any-hardlinks, I find just the disk i/o &
> network bandwith to transmit file listings & stat'ing and to send the block
> checksums is limiting on even underpowered machines.

A 100Mbps nic on a P4 is not going to be the problem here. In the case 
presented, the second filesystem was plugged to a USB interface. 
I've always used SATA hard drives, and never had I/O constrains.

Someone did the test on his own... please, check the link: 
http://lwn.net/Articles/400489/

> 
>> The amount of memory needed is much less important than the cpu
>> needed. Again, from rsync FAQ page:
>> 
>>      "Rsync needs about 100 bytes to store all the relevant information
>>      for one file, so (for example) a run with 800,000 files would
>>      consume about 80M of memory. -H and --delete increase the memory
>>      usage further."
>> 
> 
> You need to re-read that CRITICAL last sentence. Rsyncing without hard
> links scales very nicely and indeed uses little memory and minimal
> cpu. Rsyncing with pool hard links uses *tons* of memory. Been there,
> done that!

At most, you'll need another 100 bytes per hard link. if you have even if you 
have 10 hardlinks per file (that actually means 10 versions of the same file) 
it would be 8,000,000 files to process, which makes about 800 Mb of memory. 
Still not an issue (at least, it hasn't been a issue for me).

> 
>> My firefox requires about double of that memory just to open
>> www.google.com I know that is "only" to process 800,000 files, but
>> with version 3.0.0 and later, it doesn't load all the files at
>> once. With a 512 Mb computer you'll be fine, but in the particular
>> installation I was talking before, 1 Tb of data comprised of 1 year
>> of historical data (that means a really big number of hardlink per
>> file), the syncing process takes almost 100% CPU on an Intel Xeon
>> Quad Core for about 2 hours.
> 
> Have you every *actually* tried rsyncing a pool of 800,000 files on a
> computer with 512MB memory?
> I tried rsyncing a pool of 300,000 files with only maybe a couple
> dozen backups and it took days on a computer with 2GB. Again the CPU
> was not a problem. 
> 
> I'm surprised you could even rsync 1Tb of massively linked files in 2
> hours. Unless you have just a small number of large files.

No 800,000, We had at that company over 2 billion files on our fileserver. Most 
of them were .doc, .xls and the like.

BackupPC was running on a 4 Cores Xeon Dell PowerEdge 2900 II, with 2 500Gb 
SATA hard drive on software RAID-1 and 4 Gb of RAM.

And when replicating the pools, the CPU was almost 100% used.


> 
> 
>> rsync is a really cpu expensive process. You can always use caching
>> for md5 chesums process, but, I wouldn't recommend that on an
>> off-site replicated backup. Caching introduces a small probability
>> of loosing data, and that technique is already used when doing a
>> normal BackupPC backup with rsync transfer, so, if you then resync
>> that data to another drive, disk of filesystem of any kind, your
>> probability of loosing data is a power of the original one.
> 
> First, the cpu consumption (for BackupPC archives) is *not* in the
> md5sums but is in the hard linking (you can verify this by doing an
> rsync on the pool alone or rsyncing TopDir without the -H
> flag). Moreover, the cpu requirements for the rolling md5sum checksums
> are actually much less for BackupPC archives than for normal files
> since you actually rarely need to do the "rolling" part which is the
> actual cpu-intensive part. This is because you only do rolling when
> files change and pool files only change in the relatively rare event
> of chain renumbering plus in the case of the rsync method with checksum
> caching in the one-time-only event when digests are added (but this
> only affects the first and last blocks).

I did not talk about what backuppc does. I was just saying that replicating a 
BackupPC pool to another filesystem is very a CPU intensive task.

> 
> So, to the extent that you are cpu-limited, the problem is not with
> md5sums but with hard links which requires both memory to store the
> hard link list (which is limiting on many machines) plus some cpu
> intensity to search the list -- specifically rsync requires that for
> each hard linked file (which for BackupPC is *every* file), you need
> to do a binary search of the hard link list (which in BackupPC is
> every file). Also, I imagine that rsync was not optimized for the
> extreme edge case represented by BackupPC archives where (just about)
> *every* non-zero length file is hard linked. 
> 
> The bottom line is that checksum caching is unlikely to have any
> significant effect.

So, if checksum caching does not impact, or has a small impact in performance, 
what's the reson to use it? 
If you are right i will never use chechsum caching again

> 
> Second, regarding your concern of compounding checksum errors, a power
> of a small error is still small.  However, that is not even really the
> case here since the only thing one would need to worry about here is
> the false negative of having matching checksums but corrupted file
> data. But this error is not directly compounded by the BackupPC
> checksuming since it is an error in the data itself. (Note the other
> potential false negative of md5sum collisions in the block data is
> vanishingly small particularly given both block checksums and file
> checksums). False positives only at worse cause an extra rsync copy
> operation. 
> 
> More generally, if you are truly worried about the compounding of
> small errors then by extension you should never be backing up
> archive backups. I mean any backup has some probability of error (due
> to disk errors, ram errors, etc.) so a backup of a backup then has a
> power of that original error.
> 
>> Not recomended I think.  I prefer to expend a little more money on
>> the machine once and not have surprises later on when the big boss
>> ask you to recover his files....
> 
> If you worry about compounding of errors in backups then probably
> better to have two parallel BackupPC servers rather than backing up a
> backup -- since all errors compound and as above, I think a faulty
> checksum is not your most likely error source.

Well, would be nice to be able to double every single hardware resource in 
every company but most of the times you have a budget and a boss...
Of course the safest option is to have 2 backup systems. It is even safer to 
have to totally different approaches to backup your data. BackupPC is great but 
there could be a bug that destroyed your pool while you were sleeping at home, 
and next morning, your boss needs to recover a file and... voilá, there's no 
backup!

You'd better have 2 backup solutions, if you can... but in the meanwhile, it's 
better to backup your backup than nothing, don't you think so? And if I can do 
it with less probabilities of failure, I will, and if that just imply buying an 
extra $50 hard drive.

> 
>> I don't have graphs, but the amount of memory available to any
>> recent computer is more than enough for rsync. Disk I/O is somewhat
>> important, and disk bandwidth is a constraint, but, cpu speed is
>> the more important thing in my tests.
> 
> Interesting, based on my experience and the experience of most reports
> on this mailing list, memory is the main problem encountered. But
> perhaps if you have enough memory, the repeated binary search of the
> hard link list is the issue. Maybe rsync could be written better for
> this case to presort the file list by inode number or something like that.

Ok, hard links sort and search is important, very important and takes a lot of 
cpu time, I will not argue this. In earlier versions of rsync, before 3.0.0, 
rsync even broke in the process.
But, in my experience, it didn't happened again after we moved to 3.0.2
Maybe we were luky, I don't know, but the truth is that over the last year 
(2010), I used this technique to maintain 2 offsite copies of the backup.
I'm not working for that company anymore since January, now I have my own 
company, so I don't have access to those systems to make any metrics on them, I 
wish I could.

As a bottom line, I don't care if it because of checksums or because of 
hardlinks, but rsync is a really CPU intensive task.

> 
> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> BackupPC-users mailing list
> BackupPC-users AT lists.sourceforge DOT net
> List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
> Wiki:    http://backuppc.wiki.sourceforge.net
> Project: http://backuppc.sourceforge.net/


------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/