Re: [BackupPC-users] Backing up a BackupPC server
2009-06-03 00:58:59
Holger Parplies wrote at about 00:57:28 +0200 on Wednesday, June 3, 2009:
> Hi,
>
> Les Mikesell wrote on 2009-06-02 17:32:24 -0500 [Re: [BackupPC-users]
> Backing up a BackupPC server]:
> > Jeffrey J. Kosowsky wrote:
> > > [...]
> > > Once we are talking about redoing things, I would prefer to use a
> > > full md5sum hash for the name of the pool file. [...]
> > > With this approach then you would automatically have "a common hashed
> > > filename that is ['statistically'] unique across all instances for
> > > every piece of content."
> >
> > Somehow the number of possible different file contents and the number
> > possible md5sums don't seem quite statistically equivalent to me. And
> > then there's:
> >
> > http://www.mscs.dal.ca/~selinger/md5collision/
>
> first of all, if you are *not* using rsync, you *don't* get a *full* md5sum
> hash for free or even cheap. You (Jeffrey) know the code well enough to
> realize that BackupPC goes to great pains to avoid writing to the pool disk
> unless necessary. If you need to transfer the whole file (of arbitrary size)
> before you can look up the pool entry, you *have to* write a temporary copy
> (probably compressed, too, giving up the benefits you gain from only
> compressing once and decompressing when matching). You have to handle
> collisions just the same (meaning re-reading your temporary copy and
> comparing
> to the pool file). Yuck.
>
> Yes, you can special-case small files that fit into memory, but yuck just the
> same.
>
> If you use a *partial* md5sum, there's no gain from rsync, and you trivially
> get collisions just like you do now.
>
> That is not to say, if we end up using a database, that it would not be a
> good
> idea to store the full md5sum in the database. In fact, with a database, file
> names would be somewhat arbitrary, and I'd propose keeping them *short* for
> the sake of rsync et al. and file lists.
>
> Regards,
> Holger
I guess my point was as follows:
- If you use rsync, then you get the md5sums for free
- Even if you don't use rsync, given the speed of current processors,
calculating the md5sum doesn't take any longer than a full file
compare (though you can tell a file is different as soon as a
difference arises, that is not really relevant since if a file is
different you will have to copy it over anyway in which case the
md5sum doesn't add significant overhead relative to the copy
operation since you have to read in the file anyway)
- The md5sums for the pool only need to be calculated once and then
appended (or prepended) to the pool file
I'm tired and I haven't looked at the code in a few months so maybe
I'm forgetting something but I'm having trouble remembering what is the
advantage of using the partial md5sum hashes on a fast (i.e. modern)
computer where the limitation is disk speed and/or network
bandwidth. Because it seems that any time you have to read/write the
entire file, calculating the md5sum will only introduce relatively
trivial overhead relative to the disk read/write or network transfer.
I like the idea of using the full md5sum for the following reasons:
1. It allows you to check file (and hence pool) integrity at any point
2. It can be used to "uniquely" (from a statistical perspective) label
pool files without any real chance of a collision. If you are still
worried about a collision with 128 bit md5sums, I'm sure simple
ways can be found to extend it that make the chance of a collision
even more infinitesimal.
3. If the md5sum is appended/prepended to the pool file then the name
of the pool file can be found by reading any of its hard links in
the pc tree
4. Full-file md5sums are consistent with protocol>30 rsync and come
for "free" when using rsync. Since they are there anyway, why use
an alternative and less precise (and also confusing) partial md5sum
hash when you can use the full md5sum.
5. Using md5sums would get rid of the confusion between partial
md5sums used for pool hash names, the md4sums used in protocol 28
rsync and the regular *nix md5sum function.
------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises
looking to deploy the next generation of Solaris that includes the latest
innovations from Sun and the OpenSource community. Download a copy and
enjoy capabilities such as Networking, Storage and Virtualization.
Go to: http://p.sf.net/sfu/opensolaris-get
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- Re: [BackupPC-users] Backing up a BackupPC server, (continued)
- Re: [BackupPC-users] Backing up a BackupPC server, Peter Walter
- Re: [BackupPC-users] Backing up a BackupPC server, Les Mikesell
- Re: [BackupPC-users] Backing up a BackupPC server, Jeffrey J. Kosowsky
- Re: [BackupPC-users] Backing up a BackupPC server, Les Mikesell
- Re: [BackupPC-users] Backing up a BackupPC server, Holger Parplies
- Re: [BackupPC-users] Backing up a BackupPC server,
Jeffrey J. Kosowsky <=
- Re: [BackupPC-users] Backing up a BackupPC server, Jeffrey J. Kosowsky
- Re: [BackupPC-users] Backing up a BackupPC server, Holger Parplies
- Re: [BackupPC-users] Backing up a BackupPC server, Skip Guenter
- Re: [BackupPC-users] Backing up a BackupPC server, Max Hetrick
- Re: [BackupPC-users] Backing up a BackupPC server, Holger Parplies
- Re: [BackupPC-users] Backing up a BackupPC server, Steve Willoughby
Re: [BackupPC-users] Backing up a BackupPC server, Jeffrey J. Kosowsky
|
|
|