BackupPC-users

Re: [BackupPC-users] Why does backuppc transfer files already in the pool

2010-08-29 22:37:53
Subject: Re: [BackupPC-users] Why does backuppc transfer files already in the pool
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: Craig Barratt <cbarratt AT users.sourceforge DOT net>
Date: Sun, 29 Aug 2010 22:35:05 -0400
Craig Barratt wrote at about 18:45:05 -0700 on Sunday, August 29, 2010:
 > Although I probably won't support it initially in 4.0, my plan would be
 > to use the --checksum option to pre-match potential files in the pool
 > if there isn't an existing file with the same path (ie: for new or
 > renamed files).

Sounds like a good idea.

 > In 4.x, --checksum would allow an efficient transfer of any new file
 > that was already in the pool.  I would plan to use "--checksum" for
 > full backups (it's too expensive on the client for incrementals). I'll
 > probably make it a user-configured option whether a "full" does this
 > shortcut based only on the full-file MD5 matching (ie: skips block
 > digest checking), or whether a full also requires block digest matching
 > too even if the full-file MD5 matches; it could be a probability so
 > that any corruption or digest collisions (very unlikely with full-file
 > MD5, although examples are now well known) are slowly fixed.

If you also compared the file size, I imagine that a non-maliciously
constructed collision would be even more unlikely since the size and
MD5 are I would imagine "relatively" independent checksums.


 > 
 > If you are comfortable with a full backup just comparing full-file
 > MD5 digests (and all file attributes too), then there in a massive
 > reduction in server load since the MD5 digest is now stored in the
 > attribute file (since it's the path to the pool file; no hardlinks
 > remember) - it's essentially no more effort to compare MD5 digests
 > as it is comparing the other file attributes.  Basically the client
 > does most of the work for a full since it needs to read every file
 > computing the full-file MD5 digests.  But the server has no more
 > work to do than an incremental if files haven't changed.
 > 
Sounds great.

 > If you are more cautious you could increase the "block-digest-check"
 > probability to, eg, 1%, 10%, or 100%.  The last case would make it
 > behave like 3.x - every file in a full does block digest checking
 > (and consequently full file digest checking too).  However, the
 > client load will be higher since each file will be read twice
 > in this case.

I guess if you are really paranoid, there is a vanishingly small
chance that the block checksums could also collide...

I think for the average user, the real question is whether there
setting aside maliciously-created collisions, is the real-world
probability of a collision large enough to worry about relative to the
order of magnitude probabilities of other failure points.

The back-of-the-envelope calculations I did last year seem to suggest
that the chance of a random collision is vanishingly small for any
reasonable number of files.

And again if the file size could also be checked, I would imagine that
the chance of collision would go down by another couple of orders of
magnitude (assuming md5 collisions are not highly correlated with file
size).


------------------------------------------------------------------------------
Sell apps to millions through the Intel(R) Atom(Tm) Developer Program
Be part of this innovative community and reach millions of netbook users 
worldwide. Take advantage of special opportunities to increase revenue and 
speed time-to-market. Join now, and jumpstart your future.
http://p.sf.net/sfu/intel-atom-d2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/