BackupPC-users

Re: [BackupPC-users] Copying in a file instead of backing up?

2009-01-14 09:18:11
Subject: Re: [BackupPC-users] Copying in a file instead of backing up?
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Wed, 14 Jan 2009 09:15:51 -0500
Les Mikesell wrote at about 07:59:36 -0600 on Wednesday, January 14, 2009:
 > Johan Ehnberg wrote:
 > >
 > >>> OK. I can see now why this is true. But it seems like one could
 > >>>> rewrite the backuppc rsync protocol to check the pool for a file with
 > >>>> same checksum  before syncing. This could give some real speedup on
 > >>>> long files. This would be possible at least for the cpool where the
 > >>>> rsync checksums (and full file checksums) are stored at the end of
 > >>>> each file.
 > >>> Now this would be quite the feature - and it fits perfecty with the idea 
 > >>> of smart pooling that BackupPC has. The effects are rather interesting:
 > >>>
 > >>> - Different incremental levels won't be needed to preserve bandwidth
 > >>> - Full backups will indirectly use earlier incrementals as reference
 > >>>
 > >>> Definite whishlist item.
 > >> But you'll have to read through millions of files and the common case of 
 > >> a growing logfile isn't going to find a match anyway.  The only way this 
 > >> could work is if the remote rsync could send a starting hash matching 
 > >> the one used to construct the pool filenames - and then you still have 
 > >> to deal with the odds of collisions.
 > >>
 > > 
 > > Sure you are pointing to something and are right. What I don't see is 
 > > why we'd have to do an (extra?) read through millions of files?
 > 
 > You are asking to find an unknown file among millions using a checksum 
 > that is stored at the end.  How else would you find it?  The normal test 
 >   for a match uses the hashed filename to quickly eliminate the 
 > possibilities that aren't hash collisions - this only requires reading a 
 > few of the directories, not each file's contents and is something the OS 
 > can do quickly.

That's why I mentioned in my previous post that having a relational
database structure would be very helpful here since the current
hard link-based storage approach allows for only a single way of
efficiently retrieving pool files (other than by their backup path)
and that method depends on a non-standard partial file md5sum. A
relational database would allow for pool files to be found based upon
any number of attributes or md5sum-type labels.

 > 
 >  > That is
 > > done with every full anyway,
 > 
 > No, nothing ever searches the contents of the pool.  Fulls compare 
 > against the previously known matching files from that client.
 > 
 > > and in the case of an incremental it would 
 > > only be necessary for new/changed files. It would in fact also speed up 
 > > those logs because of rotation: an old log changes name but is still 
 > > found on the server.
 > 
 > On the first rotation that would only be true if the log hadn't grown 
 > since the moment of the last backup.  You'd need file chunking to take 
 > advantage of partial matches.  After that, a rotation scheme that 
 > attached a timestamp to the filename would make more sense.
 > 
 > > I suspect there is no problem in getting the hash with some tuning to 
 > > Rsync::Perl? It's just a command as long as the protocol allows it.
 > 
 > There are two problems.  One is that you have a stock rsync at the other 
 > end and at least for the protocols that Rsync::Perl understands there is 
 > not a full hash of the file sent first.  The other is that even if it 
 > did, it would have to be computed exactly in the same way that backuppc 
 > does for the pool filenames or you'll spend hours looking up each
 > match.
Are you sure that you can't get rsync to calculate the checksums (both
block and full-file) before file transfer begins -- I don't know I'm
just asking..

 > 
 > > Are collisions aren't exactly a performance problem? BackupPC handles 
 > > them nicely from what I've seen.
 > 
 > But it must have access to the contents of the file in question to 
 > handle them.  It might be possible to do that will an rsync block 
 > compare across the contents but you'd have to repeat it over each hash 
 > match to determine which, if any, have the matching content. It might 
 > not be completely impossible to do remotely, but it would take a well 
 > designed client-server protocol to match up unknown files.
 > 
 > -- 
 >    Les Mikesell
 >      lesmikesell AT gmail DOT com
 > 
 > 
 > 
 > ------------------------------------------------------------------------------
 > This SF.net email is sponsored by:
 > SourcForge Community
 > SourceForge wants to tell your story.
 > http://p.sf.net/sfu/sf-spreadtheword
 > _______________________________________________
 > BackupPC-users mailing list
 > BackupPC-users AT lists.sourceforge DOT net
 > List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:    http://backuppc.wiki.sourceforge.net
 > Project: http://backuppc.sourceforge.net/
 > 

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/