BackupPC-users

Re: [BackupPC-users] Concrete proposal for feature extension for pooling

2010-03-01 18:02:35
Subject: Re: [BackupPC-users] Concrete proposal for feature extension for pooling
From: Craig Barratt <cbarratt AT users.sourceforge DOT net>
To: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
Date: Mon, 1 Mar 2010 15:01:04 -0800
Jeffrey,

Thanks for the suggestions.  I've since decided to eliminate
hardlinks altogether in 4.x.  This is an aggressive design step,
but if successful it will resolve many of the annoying issues
with the current architecture.  (To be clear, hardlinks are
still needed in certain cases, since they provide an atomic
way of implementing certain file system operations.  But they
will no longer be used for permanent storage.)

The plan is to move to full-file MD5 digests, as we previously
discussed.  The attrib file will include extended attributes and the
file digest (extended if necessary with the chain number). So the pc
backup tree will just contain attrib files.  Each file entry in the
attrib file is its digest (plus chain number) which points to the real
file in the pool.  Chain renumbering has been eliminated, and I am
planning to eliminate BackupPC_link (by solving the race conditions
to allow pool adds to occur during backups).

That means the pool file format does not need to be changed.  Whether
you access the files via the pool or the pc tree you know the digest
(either from the file name in the pool or the attrib file in the
pc tree).

Backups will be stored as reverse deltas, so only the most recent is
complete, and all the prior backups are just the deltas to re-create
the prior backups.  The prior backups will no longer need to have
complete directory trees - they will only be deep enough to represent
the necessary changes from the next more recent backup.  That means
the storage will be decoupled from whether the backup itself is full
or incremental.  And all new backups will be relative to the most
recent (ie: IncrLevels will disappear).  There are several advantages
here, mainly around efficiency since the most recent backup is the
one that is used most often (for new backups or restores).  Plus
the most recent backup will be modified in place, rather than being
rewritten every time with hardlinks.  That should improve performance
too.

That leaves the database question.  Rather than use an external
database my plan is to keep track of the reference count changes
and update the reference counts only daily, since that's how
often the information is needed (for cleaning).  There are some
open design issues around integrity and race conditions, and I
will need an fsck-type utility to handle the case when there is
a non-clean shutdown.

Despite the significant changes in storage, I'm trying to make 4.x
generally backward compatible.  A 3.x pool will gracefully migrate to
an MD5 4.x pool (ie: pool files will be migrated when used in new
backups) and old 3.x backups will be browsable/restorable.  However,
one likely design decision is that it will be required that the first
4.x backups will have to be brand new fulls.

Craig

Jeffrey writes:

> In the past, we have had multiple discussions about adding full file
> checksums (e.g., md5, SHA-1) and/or path names to pool files to allow for
> integrity checking and reverse file look-up from the pc directory.
> 
> On the other hand, I know some people are not interested in that
> feature or overhead.
> 
> So, I would like to suggest the following compromise solution for
> discussion and improvement:
> 
> 1. Add three new "first byte" character types corresponding 1-1 to the
>    existing 3 ones (0x78, 0xd6, 0xd7). Though you may only need 2
>    since it seems like 0xd6 is obsolete(?)
> 
> 2. For pool files with the new first byte characters, extend the
>    envelope footer at the end of the file to include space for the
>    checksum (128 bits if md5sum) and for the pool file name (32 hex
>    chars plus say another 32 bits to code the chain number - 4 billion
>    chain collisions should leave enough room - famous last
>    words). Total would be 288 bits if this schema is used.
> 
> 3. Modify the handful of routines in FileZIO.pm (and
>    maybe also RsyncDigest.pm) that raw read/write pool files to
>    recognize the new first byte character flags.
> 
> 4. Create access routines that can read/write the new footer
>    information.
> 
> 5. Modify BackupPC_nightly to change the pool path in the footer
>    whenever there is chain renumbering of a file with the new first
>    byte types (should not be intensive since chain renumbering is
>    relatively rare).
> 
> 6. Either write the trailer information as new pool files are created
>    by modifying the relevant routines (again only a couple) and/or
>    create a separate routine that can recurse through the pool
>    directories and create the new footer information in a batch way.
> 
> 7. Create Config variables to allow the user to turn on/off writing
>    and tracking the new footer information. Checksums and pool paths
>    could be turned on/off separately for those worried about the
>    overhead of the checksum (adding the pool path has trivial
>    overhead). (Note a zero checksum or a zero path pool would signal
>    that info is not available.)
> 
> 8. More generally, but not necessary, it may be good to design the
>    footers corresponding to these new first bytes to be extensible in the
>    future to add other information if ever desired (e.g., other
>    checksums, file-level encryption keys etc.) This would require
>    some forethought and would add a little overhead in the storage and
>    access routines.
> 
> I believe that this proposal has several advantages:
> A. Users not interested in this functionality wouldn't be
>    affected. They wouldn't turn on the functionality so none of their
>    pool files would have the new first byte flags. In particular,
>    there would be *no* change to their pool and no added backup
>    overhead (even the tests for the new first byte would come after
>    the existing ones).
> 
> B. Changes are pretty small, limited in extent, and easy to code. I am
>    happy to help but would prefer to leave #3 to someone who knows the
>    code best to make sure all routines are patched. Also, I don't want
>    to start patching basic routines unless there is consensus that
>    this can be merged into the tree since I don't want to create a
>    fork. Also, discussion would be helpful to make sure we have a
>    robust and potentially extensible design.
> 
> C. Presence of path names greatly facilitates pool backup. Backups
>    would now happen as follows.
>    - Prevent BackupPC_nightly from running...
>    - Rsync pool (without hard links)
>    - For the pc directory, just rsync the directory structure (or
>      otherwise copy) and copy over files with only 1 link (almost
>          exclusively zero length files anyway.
>    - Run a simple perl routine that recurses through the pc directory.
>          For each non-directory file with >1 link (this is *very* fast
>          using perl find), use the file itself to read it's pool path name
>          from the footer and print out a two column link list of the file's pc
>          path name and it's pool path (this is a very simple routine to code)
>    - On the new backup directory, run a simple shell or perl script
>          that reads the file and creates the links
>    The total process would be about as fast as just doing an rsync
>    without hard links on $Topdir and there would be no scaling issues
>    do to hard links.
> 
> D. Presence of checksums allows for file integrity checking either as
>    needed or on a regular basis. Of course, I know that the rsync(d)
>    method includes md4sums but that is limited to rsync(d). Also, the
>    checksums are only inserted on the second backup. Finally, newer
>    rsync versions use md5sums so rsyncs md4sums will (hopefully) soon
>    be obsolete.
> 
> E. Pool entries with or without the added footer could co-exist in a
>    single pool - just that the information wouldn't be available for
>    use if the new first bytes aren't present. The look-up routines
>    would just return an error code signalling not available. Also,
>    existing pool entries could be converted at any time to the new
>    format without affecting pool integrity or touching the pc
>    hierarchy.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/