BackupPC-users

Re: [BackupPC-users] Concrete proposal for feature extension for pooling

2010-03-02 02:26:29
Subject: Re: [BackupPC-users] Concrete proposal for feature extension for pooling
From: Craig Barratt <cbarratt AT users.sourceforge DOT net>
To: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
Date: Mon, 1 Mar 2010 23:24:41 -0800
Jeffrey,

> Hopefully, it will also broaden the usability of BackupPC to
> filesystems and OS's that don't support hard links since it seems like
> you will be eliminating just about all the filesystem-specific
> requirements.

The file system will still need hardlink support, but as I said
only for certain file operations, not for permanent storage.

> Hopefully you can code this in an abstracted way so that other digests
> could be substituted for MD5 down the road if the need exists. E.g.,
> various shaXXX digests. This would make it more robust/extensible and
> avoid the concerns of people who think the 1 in 10^40 plus chance of
> collisions is to large with md5.

The fact that the new digest is full file MD5 is pretty opaque to
most of the code, so, yes, other digests could be supported.

> Good idea - any though about getting rid entirely of the notion of
> chains if we used an even stronger hash function such as sha512? 

Chaining will still be supported, even if the digest is stronger.  Even
with MD5 almost all users will never see a collision, and those that do
will mostly be people who download files from a site like:

    http://www.mathstat.dal.ca/~selinger/md5collision/

just to make a point.

> Also, hopefully your handling of extended attributes will be
> abstracted too so that the notion of extended attributes could be
> added to support a broader notion of filesystem attributes and
> properties. For example, native ntfs has a rich set of ACL's that is
> broader even than that which is supported by the current cygwin/rsync
> ACL implementation which only appears to support the POSIX notion of
> acls. Ideally, there would be a set of constructors that would allow
> for the definition and backup of such attributes along with
> constructors for interfacing with the different transport methods.

I'm not yet sure how to generalize things for WinXX, since we
currently rely on cygwin.  My plan is to support only xattr with
rsync on the server end, and to rely on rsync to pack incoming
ACLs into xattr via the --fake-super option (it packs the acls
into two xattr values, rsync.aacl and rsync.dacl).  However, the
one time I tested this the ACL wasn't restored correctly and I
haven't looked at it since.

> Ahhhh - so are you saying that the partial file md5sums (plus chain
> number) will still be used for numbering pool entries rather than the
> full md5sum (or alternative hash)?

No, the full MD5 hash will be used for naming pool files.  It will be
a new parallel pool.

> If you are worried about preserving the old pool, you could of course
> create a parallel new pool using a new hash function-naming scheme
> based on the full file that is hard-linked to the existing pool. This
> wouldn't require any additional storage and would just add a single
> hard link. Then the old pool could be expired as the old backups
> expire (or are converted).

Right, that's the way it is done.  If a 3.x pool file exists, it is
checked only if there are no candidate MD5 files in the new pool. If
the 3.x pool file matches it is moved (renamed) to the new pool. All
the old 3.x hardlinks are still there, and will eventually go to 1 as
the 3.x backups are expired.  If the file is expired from the 4.x pool
first, it will be removed from the new pool, and then it won't be in
either pool.  But the 3.x links are still there and the 3.x backups
are still ok.

> I think it is a *good* idea since the whole notion of difference
> between incremental and full has become a bit vague when using
> rsync. Hopefully, you would still have the ability to verify checksums
> (occasionally) to refresh integrity

Right.  Rsync "incrementals forever" will be supported nicely with the
new set up for people that just want speed.  In fact, there are several
options between "incremental" and "full":

  - just check existence and attributes (like current incr)

  - check existence, attributes, and full file MD5.  (I would subvert
    or augment rsync's --checksum to mean that.)  This requires a full
    file read on the client, but no more work on the server since the
    MD5 digest is stored like any other attribute.

  - check existence, attributes, optionally full file MD5, and do a
    full block compare on some files (ie: random subset)

  - check existence, attributes, optioanlly full file MD5, and do a
    full block compare on all files (like the current full).

> Also, it might be helpful to be able to "manually" create an
> "intermediate" full tree either as a way of adding redundancy or if
> the delta-recreation starts taking too long. Perhaps this could either
> be specified as a parameter (e.g., start new tree every X backups) or
> as something that could be triggered manually.

That's a good idea but difficult.  I need to think about it.  I still
need to write the delete code which will be able to delete any backup
(except the most recent, or any "filled" in your terminology unless the
prior one is also filled).  If such a backup is the oldest, it can be
simply removed (taking care of reference counting of course).  A more
recent backup will be merged with the next older one (so the two deltas
are a single cumulative delta).

> That being said, as you code, it might be helpful to write the attrib
> and tree access code in an abstracted way that would make it easier to
> move to an independent database in the future in case such an approach
> becomes advantageous in the future. In fact, if I am understanding
> your new approach correctly, it seems that conceptually your new
> approach is not that radically different from a pure database approach
> since your new approach is really now just a (flat) database broken up
> into multiple pieces distributed across the pc tree.

Right.  I'm taking advantage of the fact I can update the "database"
in a batch manner since I don't need real-time updates.  But I need
to make sure "atomic" operations are effectively that, except that
an unexpected shutdown or system crash can always cause count errors.

> Will there also be a routine to migrate a 3.x pool directly and
> immediately to the 4.x format. Since some of us keep old backups
> essentially forever, a graceful migration would never fully get there
> -- so it would be nice to have an offline manual way of converting old
> backups to the new format and then "throw away" the old 3.x pc tree
> once one is comfortable that everything is migrated safely.

I'm not planning on doing that.  Re-generating reverse delta backups
from 3.x would be quite difficult to test and be sure it worked,
and the running time would probably be very long.  It should be
harmless to have old 3.x backups lying around, except it will mean
the hardlink count will never go to 1.

For replication purposes you can just copy the new pool (ignoring
residual 3.x hardlinks), then copy any of the 4.x backup trees
you want (provided you include a contiguous set ending at the
most recent or filled).  Those will be quite compact since each
is just a directory tree and an attrib file per directory with
typically 20 bytes or so (magic number + digest).

Craig

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/