[BackupPC-users] Concrete proposal for feature extension for pooling

In the past, we have had multiple discussions about adding full file
checksums (e.g., md5, SHA-1) and/or path names to pool files to allow for
integrity checking and reverse file look-up from the pc directory.

On the other hand, I know some people are not interested in that
feature or overhead.

So, I would like to suggest the following compromise solution for
discussion and improvement:

1. Add three new "first byte" character types corresponding 1-1 to the
   existing 3 ones (0x78, 0xd6, 0xd7). Though you may only need 2
   since it seems like 0xd6 is obsolete(?)

2. For pool files with the new first byte characters, extend the
   envelope footer at the end of the file to include space for the
   checksum (128 bits if md5sum) and for the pool file name (32 hex
   chars plus say another 32 bits to code the chain number - 4 billion
   chain collisions should leave enough room - famous last
   words). Total would be 288 bits if this schema is used.

3. Modify the handful of routines in FileZIO.pm (and
   maybe also RsyncDigest.pm) that raw read/write pool files to
   recognize the new first byte character flags.

4. Create access routines that can read/write the new footer
   information.

5. Modify BackupPC_nightly to change the pool path in the footer
   whenever there is chain renumbering of a file with the new first
   byte types (should not be intensive since chain renumbering is
   relatively rare).

6. Either write the trailer information as new pool files are created
   by modifying the relevant routines (again only a couple) and/or
   create a separate routine that can recurse through the pool
   directories and create the new footer information in a batch way.

7. Create Config variables to allow the user to turn on/off writing
   and tracking the new footer information. Checksums and pool paths
   could be turned on/off separately for those worried about the
   overhead of the checksum (adding the pool path has trivial
   overhead). (Note a zero checksum or a zero path pool would signal
   that info is not available.)

8. More generally, but not necessary, it may be good to design the
   footers corresponding to these new first bytes to be extensible in the
   future to add other information if ever desired (e.g., other
   checksums, file-level encryption keys etc.) This would require
   some forethought and would add a little overhead in the storage and
   access routines.

I believe that this proposal has several advantages:
A. Users not interested in this functionality wouldn't be
   affected. They wouldn't turn on the functionality so none of their
   pool files would have the new first byte flags. In particular,
   there would be *no* change to their pool and no added backup
   overhead (even the tests for the new first byte would come after
   the existing ones).

B. Changes are pretty small, limited in extent, and easy to code. I am
   happy to help but would prefer to leave #3 to someone who knows the
   code best to make sure all routines are patched. Also, I don't want
   to start patching basic routines unless there is consensus that
   this can be merged into the tree since I don't want to create a
   fork. Also, discussion would be helpful to make sure we have a
   robust and potentially extensible design.

C. Presence of path names greatly facilitates pool backup. Backups
   would now happen as follows.
   - Prevent BackupPC_nightly from running...
   - Rsync pool (without hard links)
   - For the pc directory, just rsync the directory structure (or
     otherwise copy) and copy over files with only 1 link (almost
         exclusively zero length files anyway.
   - Run a simple perl routine that recurses through the pc directory.
         For each non-directory file with >1 link (this is *very* fast
         using perl find), use the file itself to read it's pool path name
         from the footer and print out a two column link list of the file's pc
         path name and it's pool path (this is a very simple routine to code)
   - On the new backup directory, run a simple shell or perl script
         that reads the file and creates the links
   The total process would be about as fast as just doing an rsync
   without hard links on $Topdir and there would be no scaling issues
   do to hard links.

D. Presence of checksums allows for file integrity checking either as
   needed or on a regular basis. Of course, I know that the rsync(d)
   method includes md4sums but that is limited to rsync(d). Also, the
   checksums are only inserted on the second backup. Finally, newer
   rsync versions use md5sums so rsyncs md4sums will (hopefully) soon
   be obsolete.

E. Pool entries with or without the added footer could co-exist in a
   single pool - just that the information wouldn't be available for
   use if the new first bytes aren't present. The look-up routines
   would just return an error code signalling not available. Also,
   existing pool entries could be converted at any time to the new
   format without affecting pool integrity or touching the pc
   hierarchy.

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/