BackupPC-users

Re: [BackupPC-users] Really slow with rsync, even on incrementals

2011-02-23 20:44:57
Subject: Re: [BackupPC-users] Really slow with rsync, even on incrementals
From: John Goerzen <jgoerzen AT complete DOT org>
To: backuppc-users AT lists.sourceforge DOT net
Date: Thu, 24 Feb 2011 01:42:56 +0000 (UTC)
Les Mikesell <lesmikesell <at> gmail.com> writes:

> This part doesn't make sense to me. You mentioned that you saw it 
> opening existing files with strace.  I don't think that should happen if 
> (a) the file exists in the previous full backup and (b) the file's 
> timestamp and length have not changed.  Are you sure these two things 
> are true for the files in question?  Note that large slowly growing 

In looking at it closer, you're right.  It's mostly opening directories and
attrib files.

I am running a full backup every two weeks and incremental levels [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14].  My last full backup was backuppc number 7,
and I'm watching incremental backup (#14) running.

For each directory, it opens

/var/lib/backuppc/pc/hostname/13/...path/dir
/var/lib/backuppc/pc/hostname/13/...path/dir/attrib
/var/lib/backuppc/pc/hostname/12/...path/dir
/var/lib/backuppc/pc/hostname/12/...path/dir/attrib

on down to 7.

I back up a number of git repositories and such that may be moderately
pathological to this code.  Though it still takes 1.5 hours to back up /usr and
1.25 hours to back up the selected bits of /ntfs.  I am currently estimating
that the majority of this time is spent opening directories and attrib files on
the backup system, as neither of these filesystems changes much.

> files or databases with small changes can be problematic, since the 
> rsync mechanism will duplicate the target file with a combination of 
> uncompressing the old and merging changed blocks from the new - which is 
> often slower than just copying the whole thing over again.

True.  I was a bit surprised at the nearly-1GB changeset from each night, and
discovered a 350MB mail index database was being backed up.  I have excluded
that, which may help some.  The filesystem it was on was taking about an hour to
back up.

/var backs up in 2 minutes.

Some tidbits from the host being backed up:

find /usr -type d -print | wc -l
23143
find /var -type d -print | wc -l
491
find /home -type d -print | wc -l
24880
find /ntfs -type d -print | wc -l
20748
find /other -type d -print | wc -l
22776
find / -xdev -type d -print | wc -l
1455

Putting it all together, that's 93,493 directories.  If it was attempting to
access each one (7 * 2 [one for dir, one for attrib] == 14) times, then that's
1,308,902 file/directory reads just to do the incremental.  That's a rate of at
least66 opens+reads+closes per second, which considering these are files and
directories likely to be spread out across the disk, is not bad at all.

rdiff-backup, which I used before, stored this information in monolithic
metadata files which may partially explain the difference.  rsync only scans a
directory tree once, and while it is calling stat() a bunch, it isn't opendir()
7 times and reading file content a bunch of times either.

There seems to be a strong correlation between lots of directories and
incremental backup times.  This correlation may not necessarily be present for
full backups.

I am thinking it may help to make my incrementals more "shallow", say [1, 2, 3]
even if I still only do fulls once every 14 days.  Of course, if I create some
25GB files that may still run into other issues.

> One other thing is that rsync will transfer the entire directory listing 
> and hold it in RAM while doing the comparisons.  This might be a problem 
> with a very large number of files and a small amount of RAM.
> 

I have been monitoring this and have not observed any swap usage during backuppc
runs.

Thanks everyone for the tips and suggestions.

Does the above analysis look accurate?  And if it does, then what should I do
about it?  I'm pretty lost at that point.

Thanks,

-- John



------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/