Re: [BackupPC-users] Copying in a file instead of backing up?
2009-01-14 09:18:11
Les Mikesell wrote at about 07:59:36 -0600 on Wednesday, January 14, 2009:
> Johan Ehnberg wrote:
> >
> >>> OK. I can see now why this is true. But it seems like one could
> >>>> rewrite the backuppc rsync protocol to check the pool for a file with
> >>>> same checksum before syncing. This could give some real speedup on
> >>>> long files. This would be possible at least for the cpool where the
> >>>> rsync checksums (and full file checksums) are stored at the end of
> >>>> each file.
> >>> Now this would be quite the feature - and it fits perfecty with the idea
> >>> of smart pooling that BackupPC has. The effects are rather interesting:
> >>>
> >>> - Different incremental levels won't be needed to preserve bandwidth
> >>> - Full backups will indirectly use earlier incrementals as reference
> >>>
> >>> Definite whishlist item.
> >> But you'll have to read through millions of files and the common case of
> >> a growing logfile isn't going to find a match anyway. The only way this
> >> could work is if the remote rsync could send a starting hash matching
> >> the one used to construct the pool filenames - and then you still have
> >> to deal with the odds of collisions.
> >>
> >
> > Sure you are pointing to something and are right. What I don't see is
> > why we'd have to do an (extra?) read through millions of files?
>
> You are asking to find an unknown file among millions using a checksum
> that is stored at the end. How else would you find it? The normal test
> for a match uses the hashed filename to quickly eliminate the
> possibilities that aren't hash collisions - this only requires reading a
> few of the directories, not each file's contents and is something the OS
> can do quickly.
That's why I mentioned in my previous post that having a relational
database structure would be very helpful here since the current
hard link-based storage approach allows for only a single way of
efficiently retrieving pool files (other than by their backup path)
and that method depends on a non-standard partial file md5sum. A
relational database would allow for pool files to be found based upon
any number of attributes or md5sum-type labels.
>
> > That is
> > done with every full anyway,
>
> No, nothing ever searches the contents of the pool. Fulls compare
> against the previously known matching files from that client.
>
> > and in the case of an incremental it would
> > only be necessary for new/changed files. It would in fact also speed up
> > those logs because of rotation: an old log changes name but is still
> > found on the server.
>
> On the first rotation that would only be true if the log hadn't grown
> since the moment of the last backup. You'd need file chunking to take
> advantage of partial matches. After that, a rotation scheme that
> attached a timestamp to the filename would make more sense.
>
> > I suspect there is no problem in getting the hash with some tuning to
> > Rsync::Perl? It's just a command as long as the protocol allows it.
>
> There are two problems. One is that you have a stock rsync at the other
> end and at least for the protocols that Rsync::Perl understands there is
> not a full hash of the file sent first. The other is that even if it
> did, it would have to be computed exactly in the same way that backuppc
> does for the pool filenames or you'll spend hours looking up each
> match.
Are you sure that you can't get rsync to calculate the checksums (both
block and full-file) before file transfer begins -- I don't know I'm
just asking..
>
> > Are collisions aren't exactly a performance problem? BackupPC handles
> > them nicely from what I've seen.
>
> But it must have access to the contents of the file in question to
> handle them. It might be possible to do that will an rsync block
> compare across the contents but you'd have to repeat it over each hash
> match to determine which, if any, have the matching content. It might
> not be completely impossible to do remotely, but it would take a well
> designed client-server protocol to match up unknown files.
>
> --
> Les Mikesell
> lesmikesell AT gmail DOT com
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> _______________________________________________
> BackupPC-users mailing list
> BackupPC-users AT lists.sourceforge DOT net
> List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
> Wiki: http://backuppc.wiki.sourceforge.net
> Project: http://backuppc.sourceforge.net/
>
------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
|
|
|