BackupPC-users

Re: [BackupPC-users] High Repeated Data Transfer Volumes During Incremental Backup

2011-04-01 11:38:21
Subject: Re: [BackupPC-users] High Repeated Data Transfer Volumes During Incremental Backup
From: John Rouillard <rouilj-backuppc AT renesys DOT com>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Fri, 1 Apr 2011 15:36:18 +0000
On Fri, Apr 01, 2011 at 10:59:17AM -0400, nhoeller AT sinet DOT ca wrote:
> On 2011-03-31 15:47 John Rouillard wrote:
> > The only way the March 30'th backup wouldn't have transferred the
> > files was if the march 29th backup was a level 1 incremental and the
> > march 30'th was a level 2 incremental. In that case the march 30'th
> > incrementals reference tree would have been from the march 29 backup
> > which already had the files in the new (moved) location and it would
> > have been able to determine that the files were identical.
> 
> > Transfer decisions are based on the file names under the pc
> > directory. If the file doesn't exist in the comparison tree (which is
> > taken from the previous higher level backup for incrementals IIRC) it
> > is transferred. Different names/path result in the file being
> > transferred again.
> 
> > Pooling decisions are based on the checksums of the files that were
> > transferred. Newly transferred files are checksummed and compared to
> > files in the pool. So after the transfer occurred pooling should have
> > happened and those newly transferred files would have been hardlinked
> > into the pooled file.
> 
> John, I am having trouble reconciling what I see with your description. As 
> far as I know, I do only one level of incremental backups with a full 
> backup once a week.

Gotcha. IIUC then every incremental will transfer any file that
is not in the full backup.

> I see where backuppc is checking date stamps of the 
> incrementals against the last full backup.  However, the backup data 
> transfer volumes suggest that the entire file is only transferred once. My 
> typical daily Internet bandwidth is around 300-500MB.  The two 350MB files 
> were uploaded March 20th.  The incremental backup on the next day bumped 
> my Internet usage to 1,480MB.  The next day was also an incremental backup 
> but my Internet usage was only 480MB. 

Hmm, from your original description:

 On March 24, backuppc did an incremental backup that picked up two
  350MB files which had been uploaded to my web server. On March,25,
  backuppc did a full backup and indicated that the files were 'same'

I thought the timeline was:

  file uploaded
  incremental (high bw) (24 march??)
  full (lower bw because the 24 march incremental tree is the reference tree)
  incremental(s) (low bw because the full tree is the reference tree)
  files moved
  incremental (high bw: uses the 24 march full for reference) (28 march)
  incremental (high bw: uses the 24 march full for reference) (29 march)
  full backup (lower bw: uses 29 march (last) incremental for
               reference) (29 march)
  incrementals (low bw: use 29 march full as the reference tree 
                that has the files in the proper (new) location.
 
> The incremental backup on March 29th bumped up my Internet usage to 990MB. 
> Even if backuppc decided it had to download the entire files because they 
> were in a different path, I would have expected that the incremental 
> backup on March 30th would have noticed that the files were already in the 
> pool.

The pool has no path information in it. Only the reference tree
does. Your reference tree for *both* incrementals after the move were
the full that you ran on the 25th which did not have the files in the
new moved location.

Only after a transfer is done is the pool checked for identical
files. Then identical files are hard linked to save disk space.
Entirely different mechanism that has no impact on transferred data,
it only impacts stored data.

> However, the Internet usage on March 30th was over 700MB when I
> checked early in the morning.

Yup because it was using the prior full (24 march) as the reference
tree and the files were in their original (un-moved) location in that
tree.

> The full backup later that day 'got it right' and only backed up 40MB. 

Right, because the full used the last (march 30) incremental as its
reference tree and the files had been moved in the March 30
incremental.

> I run a bunch of MediaWiki sites, all of which used the same code base but 
> each site installed the code in its own directory structure.  My 
> recollection is that backuppc only physically transferred one set of code 
> files.  The additional sites did not result in the same files being 
> transferred again, even though they were in different paths.

I claim all the copies of the code in each location were copied over
on the first backup. Pooling would turn all those copies into links
into a single file in the pool, but they would still have been
transferred.

We have a lot of subverision checkouts in our backup sets. When
somebody checks out a new tree, I see the files being transferred
(usually adds an hour or two to the backup). Later I also can see that
the transferred files were linked into existing copies in the pool,
but I still have the large transfers when the developers create a new
check out. The transfers only go away if:

 * the files are backed up by a full
 * the level of the current backup is higher than a backup that has
   the files. (e.g. if I have the files in a level 2 backup, the level
   3 backup won't transfer the files, but a subsequent leve1 1 backup
   will).
 * I play games and move files around under backuppc to create a
   reference backup with the files in the correct locations (not
   recommended, use at your own risk, YMMV, danger here be dragons).

> My suspicion is that backuppc gets confused if files were backed 
> up/excluded/unexcluded or backed up/moved.  I will need to test out 
> various scenarios with tracing enabled, but won't get a chance for while.

I think excludes behave the same as though the file just wasn't there,
but I am not positive about that.

If you find out otherwise, I would be inetested in seeing your
proof. What I expressed above has explained all the file transfer
triggers I have seen to date.

-- 
                                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself; 
WebMatrix provides all the features you need to develop and 
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>