Re: [BackupPC-users] Backing up a BackupPC server

Holger Parplies wrote at about 00:57:28 +0200 on Wednesday, June 3, 2009:
 > Hi,
 > 
 > Les Mikesell wrote on 2009-06-02 17:32:24 -0500 [Re: [BackupPC-users] 
 > Backing up a BackupPC server]:
 > > Jeffrey J. Kosowsky wrote:
 > > > [...]
 > > > Once we are talking about redoing things, I would prefer to use a
 > > > full md5sum hash for the name of the pool file. [...]
 > > > With this approach then you would automatically have "a common hashed
 > > > filename that is ['statistically'] unique across all instances for
 > > > every piece of content."
 > > 
 > > Somehow the number of possible different file contents and the number 
 > > possible md5sums don't seem quite statistically equivalent to me.  And 
 > > then there's:
 > > 
 > > http://www.mscs.dal.ca/~selinger/md5collision/
 > 
 > first of all, if you are *not* using rsync, you *don't* get a *full* md5sum
 > hash for free or even cheap. You (Jeffrey) know the code well enough to
 > realize that BackupPC goes to great pains to avoid writing to the pool disk
 > unless necessary. If you need to transfer the whole file (of arbitrary size)
 > before you can look up the pool entry, you *have to* write a temporary copy
 > (probably compressed, too, giving up the benefits you gain from only
 > compressing once and decompressing when matching). You have to handle
 > collisions just the same (meaning re-reading your temporary copy and 
 > comparing
 > to the pool file). Yuck.
 > 
 > Yes, you can special-case small files that fit into memory, but yuck just the
 > same.
 > 
 > If you use a *partial* md5sum, there's no gain from rsync, and you trivially
 > get collisions just like you do now.
 > 
 > That is not to say, if we end up using a database, that it would not be a 
 > good
 > idea to store the full md5sum in the database. In fact, with a database, file
 > names would be somewhat arbitrary, and I'd propose keeping them *short* for
 > the sake of rsync et al. and file lists.
 > 
 > Regards,
 > Holger

I guess my point was as follows:
- If you use rsync, then you get the md5sums for free
- Even if you don't use rsync, given the speed of current processors,
  calculating the md5sum doesn't take any longer than a full file
  compare (though you can tell a file is different as soon as a
  difference arises, that is not really relevant since if a file is
  different you will have to copy it over anyway in which case the
  md5sum doesn't add significant overhead relative to the copy
  operation since you have to read in the file anyway)
- The md5sums for the pool only need to be calculated once and then
  appended (or prepended) to the pool file

I'm tired and I haven't looked at the code in a few months so maybe
I'm forgetting something but I'm having trouble remembering what is the
advantage of using the partial md5sum hashes on a fast (i.e. modern)
computer where the limitation is disk speed and/or network
bandwidth. Because it seems that any time you have to read/write the
entire file, calculating the md5sum will only introduce relatively
trivial overhead relative to the disk read/write or network transfer.

I like the idea of using the full md5sum for the following reasons:
1. It allows you to check file (and hence pool) integrity at any point
2. It can be used to "uniquely" (from a statistical perspective) label
   pool files without any real chance of a collision. If you are still
   worried about a collision with 128 bit md5sums, I'm sure simple
   ways can be found to extend it that make the chance of a collision
   even more infinitesimal.
3. If the md5sum is appended/prepended to the pool file then the name
   of the pool file can be found by reading any of its hard links in
   the pc tree
4. Full-file md5sums are consistent with protocol>30 rsync and come
   for "free" when using rsync. Since they are there anyway, why use
   an alternative and less precise (and also confusing) partial md5sum
   hash when you can use the full md5sum.
5. Using md5sums would get rid of the confusion between partial
   md5sums used for pool hash names, the md4sums used in protocol 28
   rsync and the regular *nix md5sum function.

------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises 
looking to deploy the next generation of Solaris that includes the latest 
innovations from Sun and the OpenSource community. Download a copy and 
enjoy capabilities such as Networking, Storage and Virtualization. 
Go to: http://p.sf.net/sfu/opensolaris-get
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/