Veritas-bu

[Veritas-bu] Splitting large jobs

2005-03-12 00:11:53
Subject: [Veritas-bu] Splitting large jobs
From: michael AT mlbarrow DOT com (Michael L. Barrow)
Date: Fri, 11 Mar 2005 21:11:53 -0800
This is a multi-part message in MIME format.
--------------090701060905080600060304
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hal Skelly wrote:
> I started work on this script a while ago and haven't released it into
> the wild yet  (I believe this was published in Sysadmin magazine a while
> ago).  Thus I'm not going to guarantee it.  BUT, if you are familiar

Hey -- I know your name. I got your script from Curtis Preston -- he's 
got it up on the StorageMountain.com website at 
<http://www.storagemountain.com/free-backup-software5.html>.

I tried using it but found that it was a bit slow and had problems 
dealing with filenames with commas (since it uses the comma character as 
a delimeter).

I did some fixing to the script and gave it to Curtis to put up, but it 
looks like he didn't do that.

See attached README and improved version of the script with a couple of 
added features.

I use it on a daily basis for a 1.5TB Windows fileserver. Let me know if 
you have any problems with the new version and I'll be glad to do my 
best to fix them.

Enjoy!

-- 
Michael L. Barrow
<michael AT mlbarrow DOT com>

--------------090701060905080600060304
Content-Type: text/plain;
 name="README"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="README"

nbusplit.pl
Michael L. Barrow (michael AT mlbarrow DOT com)
2003-11-08

This Perl script is used to split a directory or filesystem into sane-
sized pieces for backing up with Veritas Netbackup. It's based on chopit.pl
that's available on storagemountain.com, W. Curtis Preston's storage
information website.

The original program allowed the user to specify the total number of streams
and it would split the filesystem into that number of more or less equal
sized pieces. For our needs, we wanted to be able to specify the size of a
stream and have the program create however many streams of that size it
needed, so I went about modifying chopit.pl.

Other modifications that I made include:
        - Fixing the buildstreams() function making it up to 113 times
          faster than the original chopit.pl script
        - Allowing the user to give several pathnames all at once to
          include in a single includes file

This script has become invaluable in backing up large filesystems and
directories. I hope it's useful for others.

Here's a sample invocation that shows me asking the script to traverse
the C:\ directory to build streams of up to 4GB in size:

C:\>nbusplit.pl -f c:\temp\inc.txt -s 4g c:\
Splitting filesystem into Netbackup streams
Filesystem: c:\
Determining directory sizes..../
Total size to divide up is 4443014315

Building streams...|
Completed in 39 second(s)

* end *

--------------090701060905080600060304
Content-Type: text/plain;
 name="nbusplit.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="nbusplit.pl"

#!/usr/bin/env perl

# nbusplit.pl
# by Michael Barrow (michael AT mlbarrow DOT com)
# nbusplit.pl version 1.0
# This is a modified version of chopit.pl, by Harold F. Skelly, Jr.
# Copyright 2001, Harold F. Skelly jr.

use strict;
use Getopt::Std;
use File::Spec;

# Enable auto-flush
$|=1;

my ($streamtotal, $chunksize, %dirs, $streamct, $starttime, @STREAMS, $debug, 
$verbose, $path, $filemode);

my ($totaltime, $duration) = 0;

our ($opt_c, $opt_s, $opt_v, $opt_f, $opt_a, $opt_D);

# Figure out what the path separator is on this platform
my ($path_sep);
$path_sep = File::Spec->catfile('FOO', 'FOO');
$path_sep =~ s|FOO||g;

if (!getopts('c:s:f:vaD')) {
        print "Invalid option specified.\n";
        usage();
}

if ((!$opt_c && !$opt_s) || !$opt_f) {
        print "You must specify chunksize or streamcount and output file.\n";
        usage();
}

if ($opt_c && $opt_s) {
        print "You can't specify streamtotal *and* chunksize.\n";
        usage();
}

if ($opt_c) {
        $streamtotal = $opt_c;
        if ($streamtotal < 2) {
                print "You must specify at least 2 streams.\n";
                usage();
        }
}

if ($opt_s) {
        $chunksize = size2int($opt_s);
        if (!defined($chunksize)) {
                print "Chunksize must be a number, or a number followed by k, 
m, or g\n";
                usage();
        }
}

if ($opt_D) {
        $debug = 1;
} else {
        $debug = 0;
}

if ($opt_v) {
        $verbose = 1;
} else {
        $verbose = 0;
}

$filemode = ">";
if ($opt_a) {
        $filemode = ">>";
}

if ($#ARGV < 0) {
        print "You must specify one or more pathnames to be split.\n";
        usage();
}

$totaltime = 0;

open (RSLTS,"$filemode $opt_f") || die "Unable to open for writing: $opt_f\n";

# Make it a binary file so that we don't get Windows line endings (this is a 
NOOP on Unix)
binmode(RSLTS);

print "Path sep is ${path_sep}\n" if ($debug); 

foreach (@ARGV) {
        $path = $_;

        $starttime = time();
        %dirs = ();
        @STREAMS = ();
        $streamct = 1;
        print "Splitting filesystem into Netbackup streams\n";
        print "Filesystem: $path\n";

        # convert any \ to /, as required
        if ($path_sep == "\\") {
                $path =~ s/\\/\//g;
        }
        printf("Determining directory sizes....");
        summarize($path);

        print "\nTotal size to divide up is $dirs{$path}\n\n";

        # Dynamically set the chunksize if the user requested a total number of 
streams
        if ($streamtotal) {
                $chunksize = $dirs{$path} / $streamtotal;
        }

        printf("Building streams...");
        buildstreams($path);

        printstreams();

        $duration = time() - $starttime;
        $totaltime += $duration;
        
        printf("\nCompleted in %d second(s)\n", $duration);
}

# Tack on the EOF marker that can be used by scripts to test if this file is 
complete
printf(RSLTS "\n#EOF#\n");
close(RSLTS);

if ($#ARGV > 0) {
        printf("\nEntire execution took %d second(s)\n", $totaltime);
}

exit;


sub summarize {
        # Subroutine to collect the sizes of the directories under a certain 
path
        # Utilizes global variables: %dirs

        # arguments:
        my $dir = shift;   # directory to process

        # variables
        my @entries; # list of all files in $fdir
        my $file;    # loop index for @entries
        my $re;      # regexp to exclude certain files
        my $dir_c;   # directory name used to build path components

        $dirs{$dir} = -s "$dir";
        
        if (opendir(D, $dir)) {
                # Collect a list of all of the files in the directory,
                # excluding '.', '..', and '.snapshot'
                $re = '(^\.$)|(^\.\.$)|(^\.snapshot$)|(^~snapshot)';
                @entries = grep(! /$re/, readdir(D));
                closedir(D);

                $dir_c = $dir;
                # If the specified directory ends in a slash, get rid of the 
terminating
                # slash, because later code will cause double slashes in a row
                if ($dir_c =~ /\/$/) {
                        substr($dir_c, -1) = undef;
                }

                # Now check each of the files in the directory
                foreach $file (@entries) {
                        next if (-l "${dir_c}/${file}"); # ignore symlinks

                        if (-d "${dir_c}/${file}") {
                                summarize("${dir_c}/${file}");
                                $dirs{$dir}+= $dirs{"${dir_c}/${file}"};
                        } else {
                                $dirs{$dir}+= -s "${dir_c}/${file}"
                        }
                        spinner();
                }
        }
        print "$dir ($dir_c) [$dirs{$dir}]\n" if ($debug);
}


sub printstreams {
        # print out streams and chunk sizes
        my ($k, $grandsum, $sz);

        print RSLTS "# nbusplit.pl split $path into $streamct streams of 
$chunksize bytes\n";

        if ($verbose) {print "Created $streamct streams of $chunksize bytes\n";}
        for ($k=0; $k<$streamct; $k++)   {
                $sz = $STREAMS[$k]{size};
                $grandsum+=$sz;
                $STREAMS[$k]{list} =~ s/\0/\n/g;
                $STREAMS[$k]{list} =~ s|/|${path_sep}|g;
                
                printf RSLTS "NEW_STREAM\n";
                print "NEW_STREAM\n" if ($verbose);
                printf RSLTS "$STREAMS[$k]{list}\n";
                print "$STREAMS[$k]{list}\n" if ($verbose);
                print "\tSIZE=$sz\n" if ($verbose);
        }

        print "\nThe Grand total of all streams is $grandsum bytes\n" if 
($verbose);
}



sub buildstreams {
        #  take a directory and return set of streams composed of 
        #  various subdirs. and files 

        #  look at ea. directory from the top down.  If the directory size 
        #  is LESS than chunksize, then include this (and thus all subdirs 
        #  and files) in the # backup stream.  If if is too large though, 
        #  then loop through all of the subdirs. down one level.

        my $indir = @_[0];
        my ($i, $elem, $sz);
        my @allelems;
        my $streamlist;
        my $indir_c;
        
        print "buildstreams: streamcount=$streamct; looking at $indir\n" if 
($debug);
        spinner();

        # check size of current dirname to see if it will fit in any existing 
        # stream and add it if so.
        for ($i=0; $i<$streamct; $i++) {
                if (!defined($STREAMS[$i]{size})) {
                        $STREAMS[$i]{size} = 0;
                        $STREAMS[$i]{list} = '';
                }
                if ( ($STREAMS[$i]{size} + $dirs{$indir}) <= $chunksize ) {
                        $STREAMS[$i]{list} .= "\0" . $indir;
                        $STREAMS[$i]{size} += $dirs{$indir};
                        return;
                }
        }
        
        # We didn't find an existing stream large enough so either create
        # a new stream (if it will fit in one) or descend to new subdirs.
        if ( $dirs{$indir} <= $chunksize ) {
                $STREAMS[$streamct]{list} = $indir;
                $STREAMS[$streamct]{size} = $dirs{$indir};
                $streamct++;
                return;
        } else {
            #go down one level using opendir and readdir till done
            opendir THISDIR, $indir or die "couldn't open $indir to recurse\n";
            # get rid of . and .. and make all full path names
            $indir_c = $indir;
            $indir_c =~ s/\/$//;
            @allelems = map("$indir_c/$_", grep(!/^\.\.?$/, readdir(THISDIR)));
            close THISDIR;

            # run the following loop twice to look a subdirs first then files
            # second recursing on directories
            foreach $elem (@allelems) {
                next if (-f $elem);
                # else we recurse on each subdir
                if  ( -d $elem ) {buildstreams($elem);}
            }
            ELEM:
            foreach $elem (@allelems) {
                next if (-d $elem);    #we've already streamified dirs, right?
                spinner();
                $sz = -s $elem;
                print "stuff: $elem ($sz)\n" if ($debug);
                if  ( -f $elem  || -l $elem) {
                        # add to a stream if it will fit, else build a new 
stream
                        for ($i=0; $i < $streamct; $i++) {
                            if (($STREAMS[$i]{size} + $sz) <= $chunksize) {
                                    print "addstream $i: $elem ($sz) 
$STREAMS[$i]{size}\n" if ($debug);
                                    $STREAMS[$i]{list} .= "\0" . $elem;
                                    $STREAMS[$i]{size} += $sz;
                                    next ELEM;    #we've it placed in a stream
                            }
                        }
                        if ( $sz > $chunksize) {
                                print "WARNING: Single file $elem exceeds 
$chunksize bytes\n" if ($verbose);
                        }
                        $STREAMS[$streamct]{list} = $elem;
                        $STREAMS[$streamct]{size} = $sz;
                        print "newstream $streamct: $elem ($sz) 
$STREAMS[$streamct]{size}\n" if ($debug);
                        $streamct++;
                        next ELEM;
                }
            }
        }
}

sub usage {
        # Prints the usage message and exits the script
        print "\n\nUsage:\n  nbusplit.pl [-v] [-a] -f <outfile> -c <stream 
count>|-s <stream size> <directory>...\n";
        print "Where: <stream count> is the number of streams to divide the 
filesystem into,\n";
        print "<stream size> is the maximum size of a stream, <directory> is 
the pathname\n";
        print "that must be split, and <out file> is the filename to store the 
include list\n";
        print "Option flags: -v for verbose output, -a for appending to outfile 
instead of\n";
        print "overwriting it.\n";
        exit(1);
}

sub size2int {
    # converts supplied argument to an integer, performing multiplication
    # as necessary if caller included k, m, g in the argument.

    # arguments:
    my $x = shift;

    my $multiplier = 1;

    $x = lc($x);
    if ($x !~ /^[0-9]+[kmg]{0,1}$/) {
        return(undef);
    }
    if ($x =~ /k/) {
        $multiplier = 1024;
    }
    elsif ($x =~ /m/) {
        $multiplier = 1024 ** 2;
    }
    elsif ($x =~ /g/) {
        $multiplier = 1024 ** 3;
    }
    return (int($x) * $multiplier);
}

sub spinner {
        return if ($verbose || $debug);
        $| = 1;
        my $spins = "|/-\\";
        our ($y);
        if (!defined($y)) {
                $y = 0;
        } else {
                $y++;
                $y = 0 if ($y > 3);
        }
        print substr($spins, $y, 1), "\b"
}

--------------090701060905080600060304--

<Prev in Thread] Current Thread [Next in Thread>