Re: [BackupPC-users] Copying in a file instead of backing up?

Johan Ehnberg wrote:
>
>>> OK. I can see now why this is true. But it seems like one could
>>>> rewrite the backuppc rsync protocol to check the pool for a file with
>>>> same checksum  before syncing. This could give some real speedup on
>>>> long files. This would be possible at least for the cpool where the
>>>> rsync checksums (and full file checksums) are stored at the end of
>>>> each file.
>>> Now this would be quite the feature - and it fits perfecty with the idea 
>>> of smart pooling that BackupPC has. The effects are rather interesting:
>>>
>>> - Different incremental levels won't be needed to preserve bandwidth
>>> - Full backups will indirectly use earlier incrementals as reference
>>>
>>> Definite whishlist item.
>> But you'll have to read through millions of files and the common case of 
>> a growing logfile isn't going to find a match anyway.  The only way this 
>> could work is if the remote rsync could send a starting hash matching 
>> the one used to construct the pool filenames - and then you still have 
>> to deal with the odds of collisions.
>>
> 
> Sure you are pointing to something and are right. What I don't see is 
> why we'd have to do an (extra?) read through millions of files?

You are asking to find an unknown file among millions using a checksum 
that is stored at the end.  How else would you find it?  The normal test 
  for a match uses the hashed filename to quickly eliminate the 
possibilities that aren't hash collisions - this only requires reading a 
few of the directories, not each file's contents and is something the OS 
can do quickly.

 > That is
> done with every full anyway,

No, nothing ever searches the contents of the pool.  Fulls compare 
against the previously known matching files from that client.

> and in the case of an incremental it would 
> only be necessary for new/changed files. It would in fact also speed up 
> those logs because of rotation: an old log changes name but is still 
> found on the server.

On the first rotation that would only be true if the log hadn't grown 
since the moment of the last backup.  You'd need file chunking to take 
advantage of partial matches.  After that, a rotation scheme that 
attached a timestamp to the filename would make more sense.

> I suspect there is no problem in getting the hash with some tuning to 
> Rsync::Perl? It's just a command as long as the protocol allows it.

There are two problems.  One is that you have a stock rsync at the other 
end and at least for the protocols that Rsync::Perl understands there is 
not a full hash of the file sent first.  The other is that even if it 
did, it would have to be computed exactly in the same way that backuppc 
does for the pool filenames or you'll spend hours looking up each match.

> Are collisions aren't exactly a performance problem? BackupPC handles 
> them nicely from what I've seen.

But it must have access to the contents of the file in question to 
handle them.  It might be possible to do that will an rsync block 
compare across the contents but you'd have to repeat it over each hash 
match to determine which, if any, have the matching content. It might 
not be completely impossible to do remotely, but it would take a well 
designed client-server protocol to match up unknown files.

-- 
   Les Mikesell
     lesmikesell AT gmail DOT com



------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/