BackupPC-users

Re: [BackupPC-users] Why does backuppc transfer files already in the pool

2010-08-29 21:47:49
Subject: Re: [BackupPC-users] Why does backuppc transfer files already in the pool
From: Craig Barratt <cbarratt AT users.sourceforge DOT net>
To: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
Date: Sun, 29 Aug 2010 18:45:05 -0700
Jeffrey writes:

> True - I haven't seen any mention in the documentation of any 'flag'
> that would send checksums.

There is an rsync option --checksum that will compute and send a
full-file MD5 digest from the client for every file as part of the
initial file list.  It is there as an alternative to attribute
checking (ie: mtime and size) to see if a file should be skipped
or inspected further (ie: for incremental backups).

In 4.x I have implemented full file MD5 digests (as you proposed :)),
to match rsync 3.x.

Although I probably won't support it initially in 4.0, my plan would be
to use the --checksum option to pre-match potential files in the pool
if there isn't an existing file with the same path (ie: for new or
renamed files).

As several people have pointed out, this isn't possible in BackupPC
3.x. The only time rsync can do an efficient transfer in 3.x is if
there is an existing file for that host/share with the same path
already backed up.

In 4.x, --checksum would allow an efficient transfer of any new file
that was already in the pool.  I would plan to use "--checksum" for
full backups (it's too expensive on the client for incrementals). I'll
probably make it a user-configured option whether a "full" does this
shortcut based only on the full-file MD5 matching (ie: skips block
digest checking), or whether a full also requires block digest matching
too even if the full-file MD5 matches; it could be a probability so
that any corruption or digest collisions (very unlikely with full-file
MD5, although examples are now well known) are slowly fixed.

If you are comfortable with a full backup just comparing full-file
MD5 digests (and all file attributes too), then there in a massive
reduction in server load since the MD5 digest is now stored in the
attribute file (since it's the path to the pool file; no hardlinks
remember) - it's essentially no more effort to compare MD5 digests
as it is comparing the other file attributes.  Basically the client
does most of the work for a full since it needs to read every file
computing the full-file MD5 digests.  But the server has no more
work to do than an incremental if files haven't changed.

If you are more cautious you could increase the "block-digest-check"
probability to, eg, 1%, 10%, or 100%.  The last case would make it
behave like 3.x - every file in a full does block digest checking
(and consequently full file digest checking too).  However, the
client load will be higher since each file will be read twice
in this case.

Craig

------------------------------------------------------------------------------
Sell apps to millions through the Intel(R) Atom(Tm) Developer Program
Be part of this innovative community and reach millions of netbook users 
worldwide. Take advantage of special opportunities to increase revenue and 
speed time-to-market. Join now, and jumpstart your future.
http://p.sf.net/sfu/intel-atom-d2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/