Re: [BackupPC-users] Copying in a file instead of backing up?
2009-01-14 09:02:26
Johan Ehnberg wrote:
>
>>> OK. I can see now why this is true. But it seems like one could
>>>> rewrite the backuppc rsync protocol to check the pool for a file with
>>>> same checksum before syncing. This could give some real speedup on
>>>> long files. This would be possible at least for the cpool where the
>>>> rsync checksums (and full file checksums) are stored at the end of
>>>> each file.
>>> Now this would be quite the feature - and it fits perfecty with the idea
>>> of smart pooling that BackupPC has. The effects are rather interesting:
>>>
>>> - Different incremental levels won't be needed to preserve bandwidth
>>> - Full backups will indirectly use earlier incrementals as reference
>>>
>>> Definite whishlist item.
>> But you'll have to read through millions of files and the common case of
>> a growing logfile isn't going to find a match anyway. The only way this
>> could work is if the remote rsync could send a starting hash matching
>> the one used to construct the pool filenames - and then you still have
>> to deal with the odds of collisions.
>>
>
> Sure you are pointing to something and are right. What I don't see is
> why we'd have to do an (extra?) read through millions of files?
You are asking to find an unknown file among millions using a checksum
that is stored at the end. How else would you find it? The normal test
for a match uses the hashed filename to quickly eliminate the
possibilities that aren't hash collisions - this only requires reading a
few of the directories, not each file's contents and is something the OS
can do quickly.
> That is
> done with every full anyway,
No, nothing ever searches the contents of the pool. Fulls compare
against the previously known matching files from that client.
> and in the case of an incremental it would
> only be necessary for new/changed files. It would in fact also speed up
> those logs because of rotation: an old log changes name but is still
> found on the server.
On the first rotation that would only be true if the log hadn't grown
since the moment of the last backup. You'd need file chunking to take
advantage of partial matches. After that, a rotation scheme that
attached a timestamp to the filename would make more sense.
> I suspect there is no problem in getting the hash with some tuning to
> Rsync::Perl? It's just a command as long as the protocol allows it.
There are two problems. One is that you have a stock rsync at the other
end and at least for the protocols that Rsync::Perl understands there is
not a full hash of the file sent first. The other is that even if it
did, it would have to be computed exactly in the same way that backuppc
does for the pool filenames or you'll spend hours looking up each match.
> Are collisions aren't exactly a performance problem? BackupPC handles
> them nicely from what I've seen.
But it must have access to the contents of the file in question to
handle them. It might be possible to do that will an rsync block
compare across the contents but you'd have to repeat it over each hash
match to determine which, if any, have the matching content. It might
not be completely impossible to do remotely, but it would take a well
designed client-server protocol to match up unknown files.
--
Les Mikesell
lesmikesell AT gmail DOT com
------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
|
|
|