BackupPC-users

Re: [BackupPC-users] Slow backups; Collision issues

2009-07-01 16:06:46
Subject: Re: [BackupPC-users] Slow backups; Collision issues
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Wed, 01 Jul 2009 16:00:49 -0400
James Esslinger wrote at about 11:20:14 -0500 on Wednesday, July 1, 2009:
 > Comments inline.
 > 
 > Jeffrey J. Kosowsky wrote:
 > > 
 > > This makes sense and is relevant to a recent thread we had on pooling
 > > and hashing.
 > > 
 > > I imagine that all your bitmaps in addition to being the same size
 > > also have the same first 1MB (or at least the same 1st and 8th 128KB
 > > blocks). This would lead to them all having the same pool hash which
 > > would then require the file to be compared against all 4527 files in
 > > the chain. If the files are big and substantially the same, then this
 > > comparison *will* take a while though I'm not sure why it would take
 > > 2-3 days.
 > 
 > I'm not really sure either, but it looks as though every bitmap is
 > compared to every other hash of this 4527 set.  One iteration through
 > this set seems to take up to 5 minutes.
 > 
 > In the case of these bitmaps.  They are basically plot graphs with a lot
 > of white space and almost identical, thus causing the 1st and 8th 128KB
 > to be the same.
 > 
 >  > This could be solved (to varying degrees) by any of the several
 > > suggestions that I have previously made for modifying the pool
 > > hashing, including:
 > > 
 > > 1. Using a full file md5sum hash rather than a partial one
 > > 2. Using the full file md5sum hash as the index in case of a collision
 > > 3. Adding the full file md5sum hash to the file header.
 > > 
 > > I know there are various pros/cons of each of these extensions but
 > > given that most people probably use rsync/rsyncd and given that protocol 30
 > > gives you full file md5sums for free, it seems to make sense to
 > > consider taking advantage of having a full file md5sum hash either
 > > instead of or in addition to the partial file md5sum hash that is used
 > > now.
 > 
 > Having some type of option on which hashing mechanism would be nice.
 > 
 > > How many such bitmap files are there? Unless there are many thousands
 > > of them, I'm not sure how it goes from 4-5 hours to 3+ days.
 > 
 > As I stated above, comparing one file to all the other files with the
 > same hash takes about 5 minutes.  So this done 4500 other times really
 > extends the backup time.  It doesn't seem as though there is any type of
 > caching for this operation so it has to read the files from the
 > filesystem everytime, which isn't very efficient.
 > 

Yes. And caching would be hard to implement since there is no
guarantee (in general) that you will read another file with a cache
collision in the near future.

I think the only real solution is to use file md5sums. At a minimum,
at least add them when there are pool collisions.

But I still believe that computing pool md5sums and adding them to the
header will add at most minimal computational overhead and will have
many benefits. Specifically,

1. If you use rsync with protocol 30, then you will get the md5sum for
   free

2. Otherwise, you only have to compute the md5sum once for each pool
   entry. Calculating the md5sum at the time of initial pool write
   should be fast relative to the bandwidth for transferring the file
   in initially and relative to the disk write time to add the file to
   the pool.

3. Having an md5sum allows one to resolve collisions with a simple
   compare of a 16 byte number versus reading in the entire file and
   comparing them bitwise.

4. Having an md5sum allows one to check pool integrity for file
   corruption

------------------------------------------------------------------------------
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/