Jeffrey J. Kosowsky wrote at about 12:07:36 -0500 on Friday, January 21, 2011:
> AHHHH OK - so no magic.
> I just coded up a new way that should in general be significantly
> faster.
>
> Basically, I create a new inode-centered pool that I call 'ipool' that
> is a decimal-based tree (rather than the hexadecimal-based pool/cpool
> trees). You can set how many levels you want. Then I recurse through
> the pool/cpool and for every entry, I store a corresponding file in
> the ipool based on the pool/cpool *inode* number. The file's contents
> are set to the *name* of the pool/cpool file (actually the path
> relative to TopDir). Note that the ipool is indexed by the least
> significant digits of the inode number to ensure more uniform
> distribution across the tree.
>
> Then you can recurse through the pc tree and quickly look up each
> inode to find it's pool/cpool location via my ipool construct.
>
> I haven't benchmarked, but I have to believe that this will in general
> be significantly faster than (re)computing the partial file md5sum for
> each file in the pc tree (though caching does help of course). Also my
> method requires constant memory so it scales nicely.
>
> Finally, I'm not sure if you implement it in BackupPC_tarPCCopy, but
> if for some reason a pc tree entry (other than backupInfo) does not
> have its inode in the ipool then I flag it and optionally correct it
> by linking the file back into the pool/cpool. By the way, this alone
> could be used as a much faster approach to solving Robin's quetion
> earlier where she needed to check and fix a large pc tree where a
> number of files had nlinks >1 but *none* of them were in the
> pool/cpool.
>
My plan is to "borrow" some of the code from BackupPC_tarPCCopy but
only use it to create the directory tree, the zero length files and
the hard links.
1. All files below the backup number level that are either hard linked
to the pool, are zero length, or are directories are fed to the
tar subroutines in BackupPC_tarPCCopy to create a single tar file
of hard links, zero length files, and directory entries. (This may
also allow me to streamline some of the code in the subroutines,
since for example the hard link targets are never long links and
the size is never bigger than old tar sizes)
2. If there is a non-zero length regular file in the pc tree below the
share level that is not-linked into the pool but either is called
'attrib' or has an fmangle name, then it is a valid BackupPC file
and it should be linked back into the pool. The program by default
makes that fix but one can optionally not choose to make such
fixes. If the fix is made, then they are backed up as #1 above. If
not then they generate exceptions for optional backup as in #4
below.
3. The /pc/<host>/<num>/backupInfo file and any files at the backup
number or higher should not be hard linked to the pool and can be
backed up with regular binary tar. For these, I generate a list
that I then pipe to binary tar (which is faster than perl tar).
In general, though, there are (relatively) not to many of these
files, so I could just use perl tar without any real slowdown.
4. Any other files generates a third 'error' list in that they really
either shouldn't be there at all or they really should be fixed as
in #1 above. This error list, should be reviewed and can then
optionally (and manually) be piped to tar if you decide to back
them up.
So basically you end up with 3 outputs:
A. Tar file of the hard links, directory entries, and zero length
files in the pc tree (the tar file is generated internally based on
Craig's routines)
B. Standard tar file of valid top level non-hard linked BackupPC log
and info files
C. Error list of files not backed up by A&B above that you can then
choose to feed to tar if you still want to back them up. If you
allow the program to fix missing hard links, then this will *only*
consist of non-BackupPC generated files so there is a good chance
you don't even want to back these up.
Overall, I would think one would get a significant speed-up over
BackupPC_tarPCCopy for the following reasons:
1. Inodes are looked up rather than calculated manually via md5sums
plus no need for cache which for large backups could slow down your
system if not enough memory. I believe this is pretty signficant.
2. Files that *should* be hard linked to the pool are corrected and
hard linked to the pool which both fixes the error and speeds up
backups since you now just need to backup the link and not the data
3. Non-zero length data files are backed up using binary tar which is
supposed to be quite a bit faster than perl-based tar
4. The perl tar code can be simplified/streamlined since we know we
have just one of 3 cases (Directory, Hard Link, Zero length file)
and we never have large file sizes or large link name targets to
deal with (though the file name itself may be long)
Any thoughts?
------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
|