BackupPC-users

Re: [BackupPC-users] Extracting Checksums from Backuppc Quickly?

2012-03-01 11:32:46
Subject: Re: [BackupPC-users] Extracting Checksums from Backuppc Quickly?
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Thu, 01 Mar 2012 11:30:37 -0500
Kyle Anderson wrote at about 11:10:26 -0500 on Thursday, March 1, 2012:
 > I have BackupPC_digestVerify.pl, but I don't understand if this can do
 > what I'm asking.
 > 
 > This tool looks like it adds and verifies the sums like you say, but can
 > it actually tell me what the sum is from a known filename?
Well there is no one sum. It verifies both the md4 block checksums and
the uncompressed full file md4 checksum. Not sure why you would want
to print them out for hundreds of thousands or more files since they
are pretty meaningless and their format is pretty unique to rsync

 >  Also I looks like the filename it wants might be the cpool
 > filename, is that right?  What kind of filenames is it expecting?

Not sure what version you have, but it's pretty clear from the usage
message that depending on the flag, you can verify a cpool directory
tree, an individual cpool file or a pc directory/file.

Here is the latest version:


#!/usr/bin/perl
#========================================================================
#
# BackupPC_digestVerify.pl
#                       
#
# DESCRIPTION

#   Check contents of cpool and/or pc tree entries (or the entire
#   tree) against the stored rsync block and file checksum digests,
#   including the 2048-byte block checksums (Adler32 + md4) and the
#   full file md4sum.

#   Optionally *fix* invalid digests (using the -f flag).
#   Optionally *add* digests to compressed files that don't have a digest.

#
# AUTHOR
#   Jeff Kosowsky (plus modified version of Craig Barratt's digestAdd code)
#
# COPYRIGHT
#   Copyright (C) 2010, 2011  Jeff Kosowsky
#   Copyright (C) 2001-2009  Craig Barratt (digestAdd code)
#
#   This program is free software; you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation; either version 2 of the License, or
#   (at your option) any later version.
#
#   This program is distributed in the hope that it will be useful,
#   but WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#   GNU General Public License for more details.
#
#   You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#
#========================================================================
#
# Version 0.3, released February 2011
#
#========================================================================

use strict;
use warnings;
use Getopt::Std;

use lib "/usr/share/BackupPC/lib";
use BackupPC::Xfer::RsyncDigest;
use BackupPC::Lib;
use File::Find;

use constant RSYNC_CSUMSEED_CACHE  => 32761;

my $default_blksize = 2048;
my $dotfreq=1000;
my %opts;
if ( !getopts("cCpavft:dVQ", \%opts) || @ARGV !=1
         || (defined($opts{v}) + defined($opts{f}) > 1)
         || (defined($opts{c}) + defined($opts{C}) + defined($opts{p}) > 1)
         || (defined($opts{Q}) + defined($opts{V}) > 1)) {
    print STDERR <<EOF;
usage: $0 [-c|-C|-p] [-v|-f] [-a][-V|-Q] [-d] [-t] [File or Directory]
  Verify Rsync digest in compressed files containing digests.
  Ignores directories and files without digests (firstbyte = 0xd7) unless
  -a flag set.
  Only prints if digest inconsistent with file content unless verbose flag.
  Note: zero length files are skipped and not counted.

  Options:
    -c   Consider path relative to cpool directory
    -C   Entry is a single cpool file name (no path)
    -p   Consider path relative to pc directory
    -v   Verify rsync digests
    -f   Verify & fix rsync digests if invalid/wrong
    -a   Add rsync digests if missing
    -t   TopDir
    -d   Print a '.' to STDERR for every $dotfreq digest checks
    -V   Verbose - print result of each check
         (default just prints result on errors/fixes/adds)
    -Q   Don\'t print results even with errors/fixes/adds

In non-quiet mode, the output consists of 3 columns.
  1. inode number
  2. return code:
       0 = digest added
       1 = digest ok
       2 = digest invalid
       3 = no digest
       <0 other error (see source)
  3. file name

EOF
exit(1);
}

#NOTE: BackupPC::Xfer::RsyncDigest->digestAdd opens fils O_RDWR so
#we should run as user backuppc!
die("BackupPC::Lib->new failed\n") if ( !(my $bpc = BackupPC::Lib->new) );
#die("BackupPC::Lib->new failed\n") if ( !(my $bpc = BackupPC::Lib->new("", "", 
"", 1)) ); #No user check

my $Topdir = $opts{t} ? $opts{t} : $bpc->TopDir();
$Topdir = $Topdir . '/';
$Topdir =~ s|//*|/|g;

my $root = '';
my $path;
if ($opts{C}) {
        $path = $bpc->MD52Path($ARGV[0], 1, "$Topdir/cpool");
        $path =~ m|(.*/)|;
        $root = $1; 
}
else {
        $root = $Topdir . "pc/" if $opts{p};
        $root = $Topdir . "cpool/" if $opts{c};
        $root =~ s|//*|/|g;
        $path = $root . $ARGV[0];
}

my $add = $opts{a};
my $verify = $opts{v};
my $fix = $opts{f};
my $verbose = $opts{V};
my $quiet = $opts{Q};
my $progress= $opts{d};

die "$0: Cannot read $path\n" unless (-r $path);

my $Log = \&{sub {};};

my ($totfiles, $totdigfiles, $totnodigfiles) = (0, 0, 0);
my ($totbadfiles, $totfixedfiles, $totaddedfiles) = (0, 0, 0);
find(\&verify_digest, $path); 

print STDERR "\n" if $progress;
$totaddedfiles = "NA" unless $add;
$totbadfiles = "NA" unless $verify || $fix;
$totfixedfiles = "NA" unless $fix; 
printf STDERR "Totfiles:   %s\tTotNOdigests:  %s\tTotADDEDdigests: %s\n",
        $totfiles, $totnodigfiles, $totaddedfiles;
printf STDERR "Totdigests: %s\tTotBADdigests: %s\tTotFIXEDdigests: %s\n",
        $totdigfiles, $totbadfiles, $totfixedfiles;
exit;

###############################################################################
sub verify_digest {
        return -200 unless (-f);
        return -201 unless -s > 0;
        my @fstat = stat(_);
        $totfiles++;

        if ($progress && !($totfiles%$dotfreq)) {
                print STDERR "."; 
                ++$|; # flush print buffer
        }

        my $action;
        #Check whether checksum is cached (i.e. first byte not 0xd7)
        if(BackupPC::Xfer::RsyncDigest->fileDigestIsCached($_)) {
                $totdigfiles++; #Digest exists
                if($fix) { #Verify & fix
                        $action = 1;
                } elsif($verify) { #Verify only
                        $action = 2;
                } else {
                        return 4; #Don't verify or fix
                }
        } else { #Missing digest
                $totnodigfiles++;
                if($add) {
                        $action = 0; #Add missing digest
                } else { #Skip over missing digest
                        $File::Find::name =~ m|$root(.*)|;
                        printf("%d %d %s\n", (stat(_))[1], 3, $1) if $verbose;
                        return -202;
                }
        }


#Note setting blockSize=-1 in my modified version of digestAdd (based
#on the original function from RsyncDigest.pm), means that the routine
#automatically calculates the blockSize bsed on the decompressed
#fileSize.
#Also leave out final protocol_version input since by setting it undefined 
#we make it determine it automatically
        my $ret = jdigestAdd(undef, $_, -1, RSYNC_CSUMSEED_CACHE,  $action);

        $totbadfiles++ unless $ret == 1 || $ret == 0;
        $totfixedfiles++ if $ret == 2 && $action == 1;
        $totaddedfiles++ if $ret == 0 && $action == 0;

        if ($verbose || ($ret!=1 && !$quiet)) {
                $File::Find::name =~ m|$root(.*)|;
                printf "%d %d %s\n", (stat(_))[1], $ret, $1;
        }
        return $ret;
}

# Return codes:
# -100: Wrong RSYNC_CSUMSEED_CACHE or zero file size
# -101: Bad/missing RsyncLib
# -102: ZIO can't open file
# -103: sysopen can't open file
# -104: sysread can't read file
# -105: Bad first byte (not 0x78, 0xd6 or 0xd7)
# -106: Can't seek to end of data portion of file (i.e. beginning of digests)
# -107: First byte not 0xd7
# -108: Error on reading digest
# -109: Can't seek when trying to position to rewrite digest data (shouldn't 
happen if only verifying)
# -110: Can't write digest data (shouldn't happen if only verifying)
# -111: Can't seek looking for extraneous data after digest (shouldn't happen 
if only verifying)
# -112: Can't truncate extraneous data after digest (shouldn't happen if only 
verifying)
# -113: If can't sysseek back to file beginning (shouldn't happen if only 
verifying)
# -114: If can't write out first byte (0xd7) (shouldn't happen if only 
verifying)
# 1: Digest verified
# 2: Digest wrong

#-200: Not a file
#-201: Zero length file
#-202: No cached checksum

######################################################################
#The following is required to use my modded version of the digestAdd subroutine

use Fcntl;
use vars qw( $RsyncLibOK );
BEGIN {
    eval "use File::RsyncP;";
    if ( $@ ) {
        #
        # File::RsyncP doesn't exist.  Define some dummy constant
        # subs so that the code below doesn't barf.
        #
        $RsyncLibOK = 0;
    } else {
        $RsyncLibOK = 1;
    }
};

use constant DEF_BLK_SIZE  => 2048;
use constant MAX_BLK_SIZE => 16384;
use constant FILE_MIN => (DEF_BLK_SIZE * 10000);
use constant FILE_MAX => (MAX_BLK_SIZE * 10000);

#JJK: Revised version of digestAdd
#JJK: if $blocksize=-1, then dynamically set blocksize to the correct size
#JJK: Specifically:
#JJK: $blockSize = DEF_BLK_SIZE (2048) if $fileSize <= FILE_MIN (2048*10000);
#JJK: $blockSize = MAX_BLK_SIZE (16384) if $fileSize >= 16384;
#JJK: otherwise $blocksize = int($filesize/10000);
#JJK:           $blockSize += 4 if (($blockSize + 4) % 64) == 0 ;

# Compute and add rsync block and file digests to the given file.
#
# Empty files don't get cached checksums.
#
# If verify is set then existing cached checksums are checked.
# If verify == 2 then only a verify is done; no fixes are applied.
# 
# Returns 0 on success.  Returns 1 on good verify and 2 on bad verify.
# Returns a variety of negative values on error.
#
sub jdigestAdd
{
    my($class, $file, $blockSize, $checksumSeed, $verify,
                $protocol_version) = @_;
    my $retValue = 0;

    #
    # Don't cache checksums if the checksumSeed is not RSYNC_CSUMSEED_CACHE
    # or if the file is empty.
    #
    return -100 if ( $checksumSeed != RSYNC_CSUMSEED_CACHE || !-s $file );


        my $dynamic = 0;
    if ( $blockSize == 0 ) {
        &$Log("digestAdd: bad blockSize ($file, $blockSize, $checksumSeed)");
        $blockSize = 2048;
    }
        elsif( $blockSize == -1) { #JJK added
                $blockSize = DEF_BLK_SIZE;
                $dynamic =1;
        }
    return -101 if ( !$RsyncLibOK );
    return -102 if ( !defined(my $fh = BackupPC::FileZIO->open($file, 0, 1)) );

    my($data, $fileDigest);
start: 
        my $blockDigest = '';;
        my $nBlks = int(65536 * 16 / $blockSize) + 1;

    my $digest = File::RsyncP::Digest->new;
    $digest->protocol($protocol_version)
                        if ( defined($protocol_version) );
    $digest->add(pack("V", $checksumSeed)) if ( $checksumSeed );

   my $fileSize = 0;
    while ( 1 ) {
        $fh->read(\$data, $nBlks * $blockSize);
        $fileSize += length($data);
        last if ( $data eq "" );
                if($dynamic && $fileSize > FILE_MIN) { #JJK: Figure out file 
size & start over
                        while(1) {
                                $fh->read(\$data, $nBlks * $blockSize);
                                $fileSize += length($data);
                                last if ( $data eq ""  || $fileSize >= 
FILE_MAX);
                        }
                        $blockSize = int($fileSize/10000);
                        $blockSize = MAX_BLK_SIZE if $blockSize > MAX_BLK_SIZE;
                        $blockSize += 4 if (($blockSize + 4) % 64) == 0 ;
                        $dynamic = 0;
                        $fh->rewind;
                        goto start;
                }
                $blockDigest .= $digest->blockDigest($data, $blockSize, 16,
                                                                                
         $checksumSeed);
                $digest->add($data);
    }
    $fileDigest = $digest->digest2;
    my $eofPosn = sysseek($fh->{fh}, 0, 1);
    $fh->close;
    my $rsyncData = $blockDigest . $fileDigest;
    my $metaData  = pack("VVVV", $blockSize,
                                 $checksumSeed,
                                 length($blockDigest) / 20,
                                 0x5fe3c289,                # magic number
                        );
    my $data2 = chr(0xb3) . $rsyncData . $metaData;
#    printf("appending %d+%d bytes to %s at offset %d\n",
#                                            length($rsyncData),
#                                            length($metaData),
#                                            $file,
#                                            $eofPosn);
    sysopen(my $fh2, $file, O_RDWR) || return -103;
    binmode($fh2);
    return -104 if ( sysread($fh2, $data, 1) != 1 );
    if ( $data ne chr(0x78) && $data ne chr(0xd6) && $data ne chr(0xd7) ) {
        &$Log(sprintf("digestAdd: $file has unexpected first char 0x%x",
                             ord($data)));
        return -105;
    }
    return -106 if ( sysseek($fh2, $eofPosn, 0) != $eofPosn );
    if ( $verify ) {
        my $data3;

        #
        # Verify the cached checksums
        #
        return -107 if ( $data ne chr(0xd7) );
        return -108 if ( sysread($fh2, $data3, length($data2) + 1) < 0 );
        if ( $data2 eq $data3 ) {
            return 1;
        }
        #
        # Checksums don't agree - fall through so we rewrite the data
        #
        &$Log(sprintf("digestAdd: %s verify failed; redoing checksums; len = 
%d,%d; eofPosn = %d, fileSize = %d",
                $file, length($data2), length($data3), $eofPosn, $fileSize));
        #&$Log(sprintf("dataNew  = %s", unpack("H*", $data2)));
        #&$Log(sprintf("dataFile = %s", unpack("H*", $data3)));
        return -109 if ( sysseek($fh2, $eofPosn, 0) != $eofPosn );
        $retValue = 2;
        return $retValue if ( $verify == 2 );
    }
    return -110 if ( syswrite($fh2, $data2) != length($data2) );
    if ( $verify ) {
        #
        # Make sure there is no extraneous data on the end of
        # the file.  Seek to the end and truncate if it doesn't
        # match our expected length.
        #
        return -111 if ( !defined(sysseek($fh2, 0, 2)) );
        if ( sysseek($fh2, 0, 1) != $eofPosn + length($data2) ) {
            if ( !truncate($fh2, $eofPosn + length($data2)) ) {
                &$Log(sprintf("digestAdd: $file truncate from %d to %d failed",
                                sysseek($fh2, 0, 1), $eofPosn + 
length($data2)));
                return -112;
            } else {
                &$Log(sprintf("digestAdd: %s truncated from %d to %d",
                                $file,
                                sysseek($fh2, 0, 1), $eofPosn + 
length($data2)));
            }
        }
    }
    return -113 if ( !defined(sysseek($fh2, 0, 0)) );
    return -114 if ( syswrite($fh2, chr(0xd7)) != 1 );
    close($fh2);
    return $retValue;
}

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/