BackupPC-users

Re: [BackupPC-users] Newbie setup questions

2011-03-12 20:12:32
Subject: Re: [BackupPC-users] Newbie setup questions
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Sat, 12 Mar 2011 20:10:04 -0500
Cesar Kawar wrote at about 23:07:53 +0100 on Friday, March 11, 2011:
 > 
 > El 11/03/2011, a las 21:13, Jeffrey J. Kosowsky escribió:
 > 
 > > Cesar Kawar wrote at about 18:27:34 +0100 on Friday, March 11, 2011:
 > >> 
 > >> El 11/03/2011, a las 14:59, Jeffrey J. Kosowsky escribió:
 > >> 
 > >>> Cesar Kawar wrote at about 10:08:10 +0100 on Friday, March 11, 2011:
 > > 

 > A 100Mbps nic on a P4 is not going to be the problem here. In the case 
 > presented, the second filesystem was plugged to a USB interface. 
 > I've always used SATA hard drives, and never had I/O constrains.
 > 
 > Someone did the test on his own... please, check the link: 
 > http://lwn.net/Articles/400489/

Very interesting article -- though there they are talking about
slowdown (and cpu consumption) that has nothing to do with hardlinks
and seems to be caused by the block checksums that check to make sure
the copy is accurate -- however, after the author optimized some stuff
the penalty of rsync (without hardlinks) over straight cp was about
50% and he was able to get copy speeds of 85MB/sec.

However, in the BackupPC case we are talking about effective copy
speeds that are several orders of magnitude slower so the source of
the problem must be something with the hard link tracking.

Honestly, I am a bit confused because your ability to rsync a 1TB
BackupPC archive in 2 hours seems to be at odds with the experience of
just about everyone else that talks about rsyncs taking days or
crashing on pools of just a few hundred gigabytes. And everybody else
has talked about memory issues. Indeed, if a 1TB archive of 1 year of
BackupPC data could be rsynced in 2 hours, I can almost guarantee that
we never would have had hundreds of threads looking for better ways to
backup a BackupPC archive. I would really love to understand why your
experience seems to be so different from others.

 > >> The amount of memory needed is much less important than the cpu
 > >> needed. Again, from rsync FAQ page:
 > >> 
 > >>   "Rsync needs about 100 bytes to store all the relevant information
 > >>   for one file, so (for example) a run with 800,000 files would
 > >>   consume about 80M of memory. -H and --delete increase the memory
 > >>   usage further."
 > >> 
 > > 
 > > You need to re-read that CRITICAL last sentence. Rsyncing without hard
 > > links scales very nicely and indeed uses little memory and minimal
 > > cpu. Rsyncing with pool hard links uses *tons* of memory. Been there,
 > > done that!
 > 
 > At most, you'll need another 100 bytes per hard link. if you have
 > even if you have 10 hardlinks per file (that actually means 10
 > versions of the same file) it would be 8,000,000 files to process,
 > which makes about 800 Mb of memory. Still not an issue (at least,
 > it hasn't been a issue for me).

I admit I don't understand how to reconcile your experience with
others and with how rsync works. I mean people with several gigs of
memory have run into problems with archives of a couple hundred gigs.

On the other hand, 800,000 pool files with 10 links per file is a
pretty small pool and represents also a pretty small set of
backups. After just backing up a few home machines for a few weeks I
had closer to a million pool files and 16 million pc files.

When you do incrementals once a day and store backups going back
several months (even with some type of exponential paring), you can
quite quickly have dozens of copies of each file per machine and if
each machine has O(500,000) files then you are quickly talking about
10's of millions of total pc file counts with even a small number of
machines. Even with 100 bytes per file, you quickly get to a point
where even 8GB is not enough of memory.


 > > I'm surprised you could even rsync 1Tb of massively linked files in 2
 > > hours. Unless you have just a small number of large files.
 > 
 > No 800,000, We had at that company over 2 billion files on our
 > fileserver. Most of them were .doc, .xls and the like.

Again your experience seems so different from everybody else. 

Was this a BackupPC archive? Because if so, even using the math of 100
bytes per file, copying 2 billion files would require 200GB of RAM to
create the hard link list and I can't imagine any normal server has
that much RAM. 

And if you really rsynced a 1TB archive containing 2 billion files in
2 hours then your experience is truly 100% different from other
people. I mean that represents a raw speed of 138MB/sec which is
orders of magnitude faster than other people's experience.

On the other hand, if you are just talking about rsyncing 2 billion
non-linked files then your experience is believable since without hard
links there are no memory issues and I can believe that the cpu might
be the rate limiting step given the need to calculate rolling
checksums, particularly if the data hasn't changed much so most of the
time is spent checking checksums rather than transferring data.

 > BackupPC was running on a 4 Cores Xeon Dell PowerEdge 2900 II, with
 > 2 500Gb SATA hard drive on software RAID-1 and 4 Gb of RAM.
 > 
 > And when replicating the pools, the CPU was almost 100% used.

Are you saying that you rsynced a BackupPC archive of 2 billion
files in 2 hours with only 4GB of RAM??? And before you said 1TB, now
it looks like your disk is only 500GB?

Again your experience seems to be 100% at odd with and orders of
magnitude better than anybody else.

 > >> rsync is a really cpu expensive process. You can always use caching
 > >> for md5 chesums process, but, I wouldn't recommend that on an
 > >> off-site replicated backup. Caching introduces a small probability
 > >> of loosing data, and that technique is already used when doing a
 > >> normal BackupPC backup with rsync transfer, so, if you then resync
 > >> that data to another drive, disk of filesystem of any kind, your
 > >> probability of loosing data is a power of the original one.
 > > 
 > > First, the cpu consumption (for BackupPC archives) is *not* in the
 > > md5sums but is in the hard linking (you can verify this by doing an
 > > rsync on the pool alone or rsyncing TopDir without the -H
 > > flag). Moreover, the cpu requirements for the rolling md5sum checksums
 > > are actually much less for BackupPC archives than for normal files
 > > since you actually rarely need to do the "rolling" part which is the
 > > actual cpu-intensive part. This is because you only do rolling when
 > > files change and pool files only change in the relatively rare event
 > > of chain renumbering plus in the case of the rsync method with checksum
 > > caching in the one-time-only event when digests are added (but this
 > > only affects the first and last blocks).
 > 

 > I did not talk about what backuppc does. I was just saying that
 > replicating a BackupPC pool to another filesystem is very a CPU
 > intensive task.

I wasn't talking about what BackupPC does either. I was saying that by
nature of the structure of BackupPC archives, rsync should require a
lot less cpu power to copy them then if they were files that changed a
lot. Much of the cpu consumption of at least non-hard link rsync is
due to aligning rolling checksums but if the files don't change then
there is no need for that cpu power and if the timestamps, perms, size
etc. don't change then even regular checksums aren't done.


 > > So, to the extent that you are cpu-limited, the problem is not with
 > > md5sums but with hard links which requires both memory to store the
 > > hard link list (which is limiting on many machines) plus some cpu
 > > intensity to search the list -- specifically rsync requires that for
 > > each hard linked file (which for BackupPC is *every* file), you need
 > > to do a binary search of the hard link list (which in BackupPC is
 > > every file). Also, I imagine that rsync was not optimized for the
 > > extreme edge case represented by BackupPC archives where (just about)
 > > *every* non-zero length file is hard linked. 
 > > 
 > > The bottom line is that checksum caching is unlikely to have any
 > > significant effect.

 > So, if checksum caching does not impact, or has a small impact in 
 > performance, what's the reson to use it? 
 > If you are right i will never use chechsum caching again

Are you talking about checksum caching for BackupPC or checksum
caching for rsync itself (which I think requires a patched version of
rsync and is non-standard)?

If you are talking about checksum caching within BackupPC itself, it
definitely can have significant benefits for full backups where the
actual files would otherwise have to be compared which would require
decompressing cpool files (slow) and then calculating block checksums
to compare the new file against the pool. So, checksum caching should
be quite beneficial since then you avoid decompressing or even reading
the full pool file but instead just need to read the appended
digest. I have not however seen any specific benchmarks. It would seem
that this benefit applies really only to files that are unchanged
because if the file changes then you will have to decompress it anyway
to find & align the changes and calculate the deltas (plus the
blocksize might very well be different if the filesize changes).

If you are talking about using checksum caching when manually rsyncing
a BacupPC archive, then I don't think it will have much of an effect
since the rate limiting step is tracking and resolving hard links (at
least in the experience of seemingly everybody else).

 > > Second, regarding your concern of compounding checksum errors, a power
 > > of a small error is still small.  However, that is not even really the
 > > case here since the only thing one would need to worry about here is
 > > the false negative of having matching checksums but corrupted file
 > > data. But this error is not directly compounded by the BackupPC
 > > checksuming since it is an error in the data itself. (Note the other
 > > potential false negative of md5sum collisions in the block data is
 > > vanishingly small particularly given both block checksums and file
 > > checksums). False positives only at worse cause an extra rsync copy
 > > operation. 
 > > 
 > > More generally, if you are truly worried about the compounding of
 > > small errors then by extension you should never be backing up
 > > archive backups. I mean any backup has some probability of error (due
 > > to disk errors, ram errors, etc.) so a backup of a backup then has a
 > > power of that original error.
 > > 
 > >> Not recomended I think.  I prefer to expend a little more money on
 > >> the machine once and not have surprises later on when the big boss
 > >> ask you to recover his files....
 > > 
 > > If you worry about compounding of errors in backups then probably
 > > better to have two parallel BackupPC servers rather than backing up a
 > > backup -- since all errors compound and as above, I think a faulty
 > > checksum is not your most likely error source.
 > 

 > Well, would be nice to be able to double every single hardware
 > resource in every company but most of the times you have a budget
 > and a boss...  Of course the safest option is to have 2 backup
 > systems. It is even safer to have to totally different approaches
 > to backup your data. BackupPC is great but there could be a bug
 > that destroyed your pool while you were sleeping at home, and next
 > morning, your boss needs to recover a file and... voilá, there's no
 > backup!

All I was saying is that the compounded errors of checksums is no
different from the compounded errors inherent in taking backups of
backups. So, as long as the probability of error introduced by
checksums is the same order of magnitude (or less) than other errors,
there is no reason to single out checksums as being an issue.

 > You'd better have 2 backup solutions, if you can... but in the
 > meanwhile, it's better to backup your backup than nothing, don't
 > you think so? And if I can do it with less probabilities of
 > failure, I will, and if that just imply buying an extra $50 hard
 > drive.

I'm not sure how big your setup is but on my SOHO setup, I use a $50
external hard drive and a $100 plugcomputer as my second backup. Now
that is way low end. But I'm sure for a few hundred total dollars you
could get a second parallel backup going using a bare bones computer
and a cheap drive. Frankly, you could use just about any old obsolete
pentium class PC. It doesn't need to be powerful since other than
compressing and computing some mdsums, it doesn't really require much
cpu power. And even if it is slower than your primary, you can run
your secondary backups just once or twice a week if needed. So you
could probably set that up for free with just scavenged hardware.

 > >> I don't have graphs, but the amount of memory available to any
 > >> recent computer is more than enough for rsync. Disk I/O is somewhat
 > >> important, and disk bandwidth is a constraint, but, cpu speed is
 > >> the more important thing in my tests.
 > > 
 > > Interesting, based on my experience and the experience of most reports
 > > on this mailing list, memory is the main problem encountered. But
 > > perhaps if you have enough memory, the repeated binary search of the
 > > hard link list is the issue. Maybe rsync could be written better for
 > > this case to presort the file list by inode number or something like that.
 > 

 > Ok, hard links sort and search is important, very important and
 > takes a lot of cpu time, I will not argue this. In earlier versions
 > of rsync, before 3.0.0, rsync even broke in the process.  But, in
 > my experience, it didn't happened again after we moved to 3.0.2
 > Maybe we were luky, I don't know, but the truth is that over the
 > last year (2010), I used this technique to maintain 2 offsite
 > copies of the backup.  I'm not working for that company anymore
 > since January, now I have my own company, so I don't have access to
 > those systems to make any metrics on them, I wish I could.
 > 
 > As a bottom line, I don't care if it because of checksums or
 > because of hardlinks, but rsync is a really CPU intensive task.

My only interest is in figuring out how you have been successful with
just a couple of GB of memory in situations that others have
floundered with. I would love it if we could all rsync BackupPC
archives of 2 billion files in 2 hours. That would be awesome!

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/