BackupPC-users

Re: [BackupPC-users] Why does backuppc transfer files already in the pool

2010-08-28 22:01:42
Subject: Re: [BackupPC-users] Why does backuppc transfer files already in the pool
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Sat, 28 Aug 2010 21:58:44 -0400
martin f krafft wrote at about 19:44:53 +0200 on Saturday, August 28, 2010:
 > 
 > I think that one of two things should happen instead:
 > 
 > 1. If the dump process has access to the following information: (a)
 >    checksum of the 1st and last/8th 128k block of the file, (b) the
 >    size of the client's file, and it considered those data reliable
 >    enough to identify an existing file, it should terminate the
 >    transfer and move on.

You are ignoring pool collisions which are very real and not
altogether infrequent. For example a constant length log or database
file could easily have the same 1st and 8th 128k block but still be
different.

The choice of using partial file checksums was a balance between speed
of calculating the checksum to rule out non-matches and pool
collisions. Using a full file checksum would make collisions
statistically almost impossible (if you used a big enough checksum)
but would require always reading in the entire file. Craig chose a
balance.

 > 
 > 2. Assuming that the two 128k block checksums and the file size are
 >    not collision-free (they probably aren't), backuppc should really
 >    uncompress the pool file and employ rsync's rolling checksum to
 >    update the file (in memory). If there were any changes, then it
 >    should write out the NewFile to disk; in the absence of changes,
 >    it should create the hardlink.

While I don't understand all the details of rsync checksums, you seem
to be missing the fact that when using rsync on the cpool, the actual
block and full-file rsync checksums are appended to the end of the
cpool file. Therefore, it is not necessary to always uncompress the
file but rather it is sufficient just to read out the stored checksums
(though with checksum caching you can choose to have a predetermined
fraction of the files checked each time). Note I may not be describing
this totally accurately but hopefully you get the point.


 > After writing this, it seems to me that (2.) is what's currently
 > happening. Can anyone confirm this?
 > 
 > Are size + 2×128k checksums not enough to identify a pool file?

Of course not - consider any two files with the same size and the same
1st and 8th 128k blocks.
My relatively small backup system has hundreds of such collisions.

------------------------------------------------------------------------------
Sell apps to millions through the Intel(R) Atom(Tm) Developer Program
Be part of this innovative community and reach millions of netbook users 
worldwide. Take advantage of special opportunities to increase revenue and 
speed time-to-market. Join now, and jumpstart your future.
http://p.sf.net/sfu/intel-atom-d2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/