BackupPC-users

Re: [BackupPC-users] How does BackupPC_tarPCCopy getting around hard link issue?

2011-01-21 16:34:15
Subject: Re: [BackupPC-users] How does BackupPC_tarPCCopy getting around hard link issue?
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Fri, 21 Jan 2011 16:31:31 -0500
Jeffrey J. Kosowsky wrote at about 15:44:45 -0500 on Friday, January 21, 2011:
 > 
 > Jeffrey J. Kosowsky wrote at about 12:07:36 -0500 on Friday, January 21, 
 > 2011:
 >  > AHHHH OK - so no magic.
 >  > I just coded up a new way that should in general be significantly
 >  > faster.
 >  > 
 >  > Basically, I create a new inode-centered pool that I call 'ipool' that
 >  > is a decimal-based tree (rather than the hexadecimal-based pool/cpool
 >  > trees). You can set how many levels you want.  Then I recurse through
 >  > the pool/cpool and for every entry, I store a corresponding file in
 >  > the ipool based on the pool/cpool *inode* number. The file's contents
 >  > are set to the *name* of the pool/cpool file (actually the path
 >  > relative to TopDir). Note that the ipool is indexed by the least
 >  > significant digits of the inode number to ensure more uniform
 >  > distribution across the tree.
 >  > 
 >  > Then you can recurse through the pc tree and quickly look up each
 >  > inode to find it's pool/cpool location via my ipool construct.
 >  > 
 >  > I haven't benchmarked, but I have to believe that this will in general
 >  > be significantly faster than (re)computing the partial file md5sum for
 >  > each file in the pc tree (though caching does help of course). Also my
 >  > method requires constant memory so it scales nicely.
 >  > 
 >  > Finally, I'm not sure if you implement it in BackupPC_tarPCCopy, but
 >  > if for some reason a pc tree entry (other than backupInfo) does not
 >  > have its inode in the ipool then I flag it and optionally correct it
 >  > by linking the file back into the pool/cpool. By the way, this alone
 >  > could be used as a much faster approach to solving Robin's quetion
 >  > earlier where she needed to check and fix a large pc tree where a
 >  > number of files had nlinks >1 but *none* of them were in the
 >  > pool/cpool.
 >  > 
 > 
 > My plan is to "borrow" some of the code from BackupPC_tarPCCopy but
 > only use it to create the directory tree, the zero length files and
 > the hard links.
 > 
 > 1. All files below the backup number level that are either hard linked
 >    to the pool, are zero length, or are directories are fed to the
 >    tar subroutines in BackupPC_tarPCCopy to create a single tar file
 >    of hard links, zero length files, and directory entries. (This may
 >    also allow me to streamline some of the code in the subroutines,
 >    since for example the hard link targets are never long links and
 >    the size is never bigger than old tar sizes)
 > 
 > 2. If there is a non-zero length regular file in the pc tree below the
 >    share level that is not-linked into the pool but either is called
 >    'attrib' or has an fmangle name, then it is a valid BackupPC file
 >    and it should be linked back into the pool. The program by default
 >    makes that fix but one can optionally not choose to make such
 >    fixes. If the fix is made, then they are backed up as #1 above. If
 >    not then they generate exceptions for optional backup as in #4
 >    below.
 > 
 > 3. The /pc/<host>/<num>/backupInfo file and any files at the backup
 >    number or higher should not be hard linked to the pool and can be
 >    backed up with regular binary tar. For these, I generate a list
 >    that I then pipe to binary tar (which is faster than perl tar). 
 >    In general, though, there are (relatively) not to many of these
 >    files, so I could just use perl tar without any real slowdown.
 > 
 > 4. Any other files generates a third 'error' list in that they really
 >    either shouldn't be there at all or they really should be fixed as
 >    in #1 above. This error list, should be reviewed and can then
 >    optionally (and manually) be piped to tar if you decide to back
 >    them up.
 > 
 > So basically you end up with 3 outputs:
 > A. Tar file of the hard links, directory entries, and zero length
 >    files in the pc tree (the tar file is generated internally based on
 >    Craig's routines)
 > B. Standard tar file of valid top level non-hard linked BackupPC log
 >    and info files
 > C. Error list of files not backed up by A&B above that you can then
 >    choose to feed to tar if you still want to back them up. If you
 >    allow the program to fix missing hard links, then this will *only*
 >    consist of non-BackupPC generated files so there is a good chance
 >    you don't even want to back these up.
 > 
 > Overall, I would think one would get a significant speed-up over
 > BackupPC_tarPCCopy for the following reasons:
 > 1. Inodes are looked up rather than calculated manually via md5sums
 >    plus no need for cache which for large backups could slow down your
 >    system if not enough memory. I believe this is pretty signficant.
 > 2. Files that *should* be hard linked to the pool are corrected and
 >    hard linked to the pool which both fixes the error and speeds up
 >    backups since you now just need to backup the link and not the data
 > 3. Non-zero length data files are backed up using binary tar which is
 >    supposed to be quite a bit faster than perl-based tar
 > 4. The perl tar code can be simplified/streamlined since we know we
 >    have just one of 3 cases (Directory, Hard Link, Zero length file)
 >    and we never have large file sizes or large link name targets to
 >    deal with (though the file name itself may be long)
 > 
 > Any thoughts?
 >      
Alternatively, I am thinking of avoiding tar altogether.

First, rather than generating a tar file for the directories,
zero-length files, and hard links it would be simpler and maybe faster
(both on the encoding and decoding end) to generate a simple list of
form:

1.If hard link:
<path to cpool hard link> <file name>

2. If zero length file:
Z <uid> <gid> <mode> <mtime> <ctime???> <file name>

3.If directory:
D <uid> <gid> <mode> <mtime> <ctime???> <directory name>

Clearly this would be simpler to generate, has less header overhead
than tar files, has no name restriction lengths, is easier for humans
to read & parse, and can be extended easily to add other attributes

Unpacking the output on the receiving end would just require a simple
<INFILE> loop with the 3 above cases to:
1. link <path to cpool hard link> <file name>
2. sysopen/chown/chmod/etc.
3. mkdir/chown/chmod/etc.

Second for both the top level log and bakInfo files, rather than
creating a tar file, it might be better just to create a list that
could then be fed to rsync, tar, cpio, cp etc. Similarly for the list
of "error" files (i.e. non-zero, non-directory files below the share
level that are not linked to the pool. In particular, you can feed the
list to rsync for potential significant speed improvement if you are
doing an "incremental" type backup of your BackupPC archive.

(Note also adding the rsync md4sum's to the ipool tree which I hinted
to in my first post would allow one to do incrementals by comparing
inode md5sums to see if the pool inode has changed but that is for a
later date)
   
Any thoughts?

------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/