BackupPC-users

Re: [BackupPC-users] How does BackupPC_tarPCCopy getting around hard link issue?

2011-01-21 15:47:51
Subject: Re: [BackupPC-users] How does BackupPC_tarPCCopy getting around hard link issue?
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Fri, 21 Jan 2011 15:44:45 -0500
Jeffrey J. Kosowsky wrote at about 12:07:36 -0500 on Friday, January 21, 2011:
 > AHHHH OK - so no magic.
 > I just coded up a new way that should in general be significantly
 > faster.
 > 
 > Basically, I create a new inode-centered pool that I call 'ipool' that
 > is a decimal-based tree (rather than the hexadecimal-based pool/cpool
 > trees). You can set how many levels you want.  Then I recurse through
 > the pool/cpool and for every entry, I store a corresponding file in
 > the ipool based on the pool/cpool *inode* number. The file's contents
 > are set to the *name* of the pool/cpool file (actually the path
 > relative to TopDir). Note that the ipool is indexed by the least
 > significant digits of the inode number to ensure more uniform
 > distribution across the tree.
 > 
 > Then you can recurse through the pc tree and quickly look up each
 > inode to find it's pool/cpool location via my ipool construct.
 > 
 > I haven't benchmarked, but I have to believe that this will in general
 > be significantly faster than (re)computing the partial file md5sum for
 > each file in the pc tree (though caching does help of course). Also my
 > method requires constant memory so it scales nicely.
 > 
 > Finally, I'm not sure if you implement it in BackupPC_tarPCCopy, but
 > if for some reason a pc tree entry (other than backupInfo) does not
 > have its inode in the ipool then I flag it and optionally correct it
 > by linking the file back into the pool/cpool. By the way, this alone
 > could be used as a much faster approach to solving Robin's quetion
 > earlier where she needed to check and fix a large pc tree where a
 > number of files had nlinks >1 but *none* of them were in the
 > pool/cpool.
 > 

My plan is to "borrow" some of the code from BackupPC_tarPCCopy but
only use it to create the directory tree, the zero length files and
the hard links.

1. All files below the backup number level that are either hard linked
   to the pool, are zero length, or are directories are fed to the
   tar subroutines in BackupPC_tarPCCopy to create a single tar file
   of hard links, zero length files, and directory entries. (This may
   also allow me to streamline some of the code in the subroutines,
   since for example the hard link targets are never long links and
   the size is never bigger than old tar sizes)

2. If there is a non-zero length regular file in the pc tree below the
   share level that is not-linked into the pool but either is called
   'attrib' or has an fmangle name, then it is a valid BackupPC file
   and it should be linked back into the pool. The program by default
   makes that fix but one can optionally not choose to make such
   fixes. If the fix is made, then they are backed up as #1 above. If
   not then they generate exceptions for optional backup as in #4
   below.

3. The /pc/<host>/<num>/backupInfo file and any files at the backup
   number or higher should not be hard linked to the pool and can be
   backed up with regular binary tar. For these, I generate a list
   that I then pipe to binary tar (which is faster than perl tar). 
   In general, though, there are (relatively) not to many of these
   files, so I could just use perl tar without any real slowdown.

4. Any other files generates a third 'error' list in that they really
   either shouldn't be there at all or they really should be fixed as
   in #1 above. This error list, should be reviewed and can then
   optionally (and manually) be piped to tar if you decide to back
   them up.

So basically you end up with 3 outputs:
A. Tar file of the hard links, directory entries, and zero length
   files in the pc tree (the tar file is generated internally based on
   Craig's routines)
B. Standard tar file of valid top level non-hard linked BackupPC log
   and info files
C. Error list of files not backed up by A&B above that you can then
   choose to feed to tar if you still want to back them up. If you
   allow the program to fix missing hard links, then this will *only*
   consist of non-BackupPC generated files so there is a good chance
   you don't even want to back these up.

Overall, I would think one would get a significant speed-up over
BackupPC_tarPCCopy for the following reasons:
1. Inodes are looked up rather than calculated manually via md5sums
   plus no need for cache which for large backups could slow down your
   system if not enough memory. I believe this is pretty signficant.
2. Files that *should* be hard linked to the pool are corrected and
   hard linked to the pool which both fixes the error and speeds up
   backups since you now just need to backup the link and not the data
3. Non-zero length data files are backed up using binary tar which is
   supposed to be quite a bit faster than perl-based tar
4. The perl tar code can be simplified/streamlined since we know we
   have just one of 3 cases (Directory, Hard Link, Zero length file)
   and we never have large file sizes or large link name targets to
   deal with (though the file name itself may be long)

Any thoughts?
        

------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/