Re: [BackupPC-users] Concrete proposal for feature extension for pooling

All-in-all sounds AWESOME - can't wait to see it! 
My inline comments below...
Craig Barratt wrote at about 15:01:04 -0800 on Monday, March 1, 2010:
 > Jeffrey,
 > 
 > Thanks for the suggestions.  I've since decided to eliminate
 > hardlinks altogether in 4.x.  This is an aggressive design step,
 > but if successful it will resolve many of the annoying issues
 > with the current architecture.  (To be clear, hardlinks are
 > still needed in certain cases, since they provide an atomic
 > way of implementing certain file system operations.  But they
 > will no longer be used for permanent storage.)

I'm sure that wasn't an easy decision but given that probably more
than half the recent posts here involve issues related to hard links,
I think it is the right long-term decision.

Hopefully, it will also broaden the usability of BackupPC to
filesystems and OS's that don't support hard links since it seems like
you will be eliminating just about all the filesystem-specific
requirements.

 > The plan is to move to full-file MD5 digests, as we previously
 > discussed.  The attrib file will include extended attributes and the
 > file digest (extended if necessary with the chain number). So the pc
 > backup tree will just contain attrib files.  Each file entry in the
 > attrib file is its digest (plus chain number) which points to the real
 > file in the pool.

Couple of thoughts/suggestions/questions.

Hopefully you can code this in an abstracted way so that other digests
could be substituted for MD5 down the road if the need exists. E.g.,
various shaXXX digests. This would make it more robust/extensible and
avoid the concerns of people who think the 1 in 10^40 plus chance of
collisions is to large with md5.

Also, hopefully your handling of extended attributes will be
abstracted too so that the notion of extended attributes could be
added to support a broader notion of filesystem attributes and
properties. For example, native ntfs has a rich set of ACL's that is
broader even than that which is supported by the current cygwin/rsync
ACL implementation which only appears to support the POSIX notion of
acls. Ideally, there would be a set of constructors that would allow
for the definition and backup of such attributes along with
constructors for interfacing with the different transport methods.

 >  Chain renumbering has been eliminated, and I am
 > planning to eliminate BackupPC_link (by solving the race conditions
 > to allow pool adds to occur during backups).

Good idea - any though about getting rid entirely of the notion of
chains if we used an even stronger hash function such as sha512? 

I mean the chance of a collision is then so vanishingly small that
you are much more likely to have other types of failure. And of course
given current processors, the cost of computing hashes is a lot less
than that of disk reads/writes and network bandwidth.

 > That means the pool file format does not need to be changed.  Whether
 > you access the files via the pool or the pc tree you know the digest
 > (either from the file name in the pool or the attrib file in the
 > pc tree).

Ahhhh - so are you saying that the partial file md5sums (plus chain
number) will still be used for numbering pool entries rather than the
full md5sum (or alternative hash)?

If you are worried about preserving the old pool, you could of course
create a parallel new pool using a new hash function-naming scheme
based on the full file that is hard-linked to the existing pool. This
wouldn't require any additional storage and would just add a single
hard link. Then the old pool could be expired as the old backups
expire (or are converted).

 > Backups will be stored as reverse deltas, so only the most recent is
 > complete, and all the prior backups are just the deltas to re-create
 > the prior backups.  The prior backups will no longer need to have
 > complete directory trees - they will only be deep enough to represent
 > the necessary changes from the next more recent backup.  That means
 > the storage will be decoupled from whether the backup itself is full
 > or incremental.  And all new backups will be relative to the most
 > recent (ie: IncrLevels will disappear).  There are several advantages
 > here, mainly around efficiency since the most recent backup is the
 > one that is used most often (for new backups or restores).  Plus
 > the most recent backup will be modified in place, rather than being
 > rewritten every time with hardlinks.  That should improve performance
 > too.

I think it is a *good* idea since the whole notion of difference
between incremental and full has become a bit vague when using
rsync. Hopefully, you would still have the ability to verify checksums
(occasionally) to refresh integrity

Also, it might be helpful to be able to "manually" create an
"intermediate" full tree either as a way of adding redundancy or if
the delta-recreation starts taking too long. Perhaps this could either
be specified as a parameter (e.g., start new tree every X backups) or
as something that could be triggered manually.

 > That leaves the database question.  Rather than use an external
 > database my plan is to keep track of the reference count changes
 > and update the reference counts only daily, since that's how
 > often the information is needed (for cleaning).  There are some
 > open design issues around integrity and race conditions, and I
 > will need an fsck-type utility to handle the case when there is
 > a non-clean shutdown.

Given the past flame-wars, I think your approach is a good
compromise. The elimination of hard links helps separate the data from
the database (which in your case is really just the filesystem tree of
attrib files).

That being said, as you code, it might be helpful to write the attrib
and tree access code in an abstracted way that would make it easier to
move to an independent database in the future in case such an approach
becomes advantageous in the future. In fact, if I am understanding
your new approach correctly, it seems that conceptually your new
approach is not that radically different from a pure database approach
since your new approach is really now just a (flat) database broken up
into multiple pieces distributed across the pc tree.

 > Despite the significant changes in storage, I'm trying to make 4.x
 > generally backward compatible.  A 3.x pool will gracefully migrate to
 > an MD5 4.x pool (ie: pool files will be migrated when used in new
 > backups) and old 3.x backups will be browsable/restorable.  However,
 > one likely design decision is that it will be required that the first
 > 4.x backups will have to be brand new fulls.

Sounds good.

Will there also be a routine to migrate a 3.x pool directly and
immediately to the 4.x format. Since some of us keep old backups
essentially forever, a graceful migration would never fully get there
-- so it would be nice to have an offline manual way of converting old
backups to the new format and then "throw away" the old 3.x pc tree
once one is comfortable that everything is migrated safely.

In summary, this all sounds AWESOME -- can't wait to see it!

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/