BackupPC-users

Re: [BackupPC-users] Wrong blocksize (!=2048) in rsync checksums for some files SOLVED BUT STILL A QUESTION...

2011-02-21 01:29:43
Subject: Re: [BackupPC-users] Wrong blocksize (!=2048) in rsync checksums for some files SOLVED BUT STILL A QUESTION...
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Mon, 21 Feb 2011 01:26:46 -0500
Jeffrey J. Kosowsky wrote at about 21:32:46 -0500 on Saturday, February 19, 
2011:
 > Jeffrey J. Kosowsky wrote at about 01:27:15 -0500 on Friday, February 18, 
 > 2011:
 >  > Jeffrey J. Kosowsky wrote at about 16:23:49 -0500 on Thursday, February 
 > 17, 2011:
 >  >  > Jeffrey J. Kosowsky wrote at about 15:41:26 -0500 on Thursday, 
 > February 17, 2011:
 >  >  >  > I have been running my BackupPC_digestVerify.pl program to check the
 >  >  >  > rsync digests in my pool.
 >  >  >  > 
 >  >  >  > Looking through the 1/x/x/ tree, I found 3 new bad digests out of
 >  >  >  > about 36000 when using the default blocksize of 2048.
 >  >  >  > 
 >  >  >  > It turns out that those 3 digests have a blocksize !=2048 -- and
 >  >  >  > indeed the digests do verify if you use that blocksize.
 >  >  >  > These files have block size 2327 and 9906 (twice).
 >  >  >  > Note the file sizes are 99MB, 11MB, and 16MB.
 >  >  >  > 
 >  >  >  > This seems *weird* and *wrong* since I thought the blocksize was 
 > fixed
 >  >  >  > to 2048 according to the (default) parameters passed to rsync in the
 >  >  >  > config.pl file. Specifically,
 >  >  >  >       '--block-size=2048',
 >  >  >  > 
 >  >  >  > Any idea why rsync may be ignoring this and using a larger blocksize
 >  >  >  > for these files?
 >  >  > 
 >  >  > OK this is weird... the block size used is the *uncompressed* file
 >  >  > size divided by 10,000 (rounded to integer). 
 >  >  > 
 >  >  > This too is weird since the normal rsync algorithm uses the rounded
 >  >  > sqrt of the (uncompressed) file length for the blocksize (as long as
 >  >  > it is >700 and < MAX_BLOCK_SIZE which I think may be 16,384).
 >  >  > 
 >  >  > So what is going on here and why is rsync neither using the
 >  >  > --block-size=2048 value nor the heuristic sqrt(filesize) number?
 >  >  > 
 >  > 
 >  > OK - I see some code in RsyncDigest.pm that seems to set the
 >  > blocksize to:
 >  >             defaultBlksize   if filesize/10000 < defaultBlkSize
 >  >             filesize/10000
 >  >             16384            if filesize/10000 > 16384
 >  > were it seems that defaultBlkSize = 700
 >  > 
 >  > Not sure why filesize/10000 is chosen though rather than
 >  > sqrt(filesize) as per the regular rsync algorithm heuristic.
 >  > 
 >  > Also, I'm confused about how this reconciles with the rsync parameter
 >  > that would seemingly force the block size to 2048. And indeed nearly
 >  > all the cpool files do have a blocksize of 2048.
 >  > 
 >  > Now since the appended rsync digest doesn't record the blocksize (only
 >  > the number of blocks), how does BackupPC on the next round know
 >  > whether the blocksize is 2048 or the one set by the above
 >  > heuristic. And if BackupPC does not know which then it would seem that
 >  > the rsync checksum is not going to be helpful.
 >  > 
 >  > In particular, if rsync is given the rsync arg of --blocksize=2048,
 >  > then won't cpool files with blocksize != 2048 cause rsync to waste
 >  > time trying to align blocks based on incompatible block sizes?
 >  > 
 >  > So, either I am missing something here (very likely) or something is
 >  > broken...
 >  > 
 >  > And again, this blocksize != 2048 seems to only affect a *small*
 >  > fraction of all the files with a rsync digest (maybe about 1-2 per
 >  > 1000 files with digests)
 >  > 
 > 
 > I just checked through my whole cpool and found 782 files with
 > blocksize != 2048 out of 569816 files with digests (which is a about
 > one out of every 700 files).
 > 

OK... I think the code that sets it to something other than 2048 is
actually in File::Rsync.pm. Here is the code snippet...
                # The local file is a regular file, so generate and
                # send the checksums.
                #

                #
                # Compute adaptive block size, from $rs->{blockSize}
                # to 16384 based on file size.
                #
                if ( $blkSize <= 0 ) {
                    $blkSize = int($attr->{size} / 10000);
                    $blkSize = $rs->{blockSize}
                                    if ( $blkSize < $rs->{blockSize} );
                    $blkSize = 16384 if ( $blkSize > 16384 );
                }

So, it seems like 2048 is actually only the *minimal* blocksize and if
any file has length > 2048 * 10,000 = 20.48 MB, then the blocksize is
determined by min(filesize/10000, 16384).

This leaves me still with some questions/comments:
1. This usage of the rsync --block-size argument is CONFUSING though
   since according to the rsync manpages setting --block-size should
   *fixe* the block-size rather than setting a minimum that is then
   determined by a heuristic.

2. Furthermore, the "adaptive block size" used in File::RsyncP of
   dividing the file size by 10,000 is likely to be less ideal than the
   sqrt(filesize) heuristic used in the standard rsync.

3. The calculation of blockSize seems to differ between fileCsumSend in
   File::RsyncP and DigestAdd in RysncDigest.pm.
   Specifically, DigestAdd includes the extra correction line:
        $blkSize += 4 if ( (($blkSize + 4) % 64) == 0 );
   But somehow, it still works out...

4. Finally, it seems to me that rsync will now not be efficient when
   comparing blocks of files that have changed in size since it won't
   know the size (and hence block size) of the original file and hence
   if the new file has a different size, it will end up using a
   different block size and won't be able to reuse the saved checksums
   and will have to decompress the file to compute new block
   checksums. Though this may not be avoidable...

In any case, knowing that the blocksize is actually *always*
dynamically calculated required me to rewrite my
BackupPC_digestVerify.pl routine.

To do this, I modified Craig Barrat's original digestAdd (from
RsyncDigest.pm) routine so that now it calculates the appropriate
$blockSize if the input parameter $blockSize is set to -1 (Note if
$blockSize is set to 0 then a fixed default block size of 2048 is used
as before).

In order to be efficient and avoid having to always decompress the cpooled
file twice or store it all in memory, my modification starts out
assuming that the block size is 2048 (which is true for all files
<20MB). If the file extends beyond 20MB, then it just reads the rest
of the file to find the filesize, then calculates the block size based
on that and starts the digest calculation again. There might be ways
to be slightly more efficient but I wanted to be as true to the
original digestAdd code as possible (plus at least on my system the
vast majority of files are < 20MB)

Anyway here is the code that now finally seems to properly verify
and/or fix and/or add rsync checksum digests to any compressed file.

-------------------------------------------------------------------------------

#!/usr/bin/perl
#========================================================================
#
# BackupPC_digestVerify.pl
#                       
#
# DESCRIPTION

#   Check contents of cpool and/or pc tree entries (or the entire
#   tree) against the stored rsync block and file checksum digests,
#   including the 2048-byte block checksums (Adler32 + md4) and the
#   full file md4sum.

#   Optionally *fix* invalid digests (using the -f flag).
#   Optionally *add* digests to compressed files that don't have a digest.

#
# AUTHOR
#   Jeff Kosowsky (plus modified version of Craig Barratt's digestAdd code)
#
# COPYRIGHT
#   Copyright (C) 2010, 2011  Jeff Kosowsky
#   Copyright (C) 2001-2009  Craig Barratt (digestAdd code)
#
#   This program is free software; you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation; either version 2 of the License, or
#   (at your option) any later version.
#
#   This program is distributed in the hope that it will be useful,
#   but WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#   GNU General Public License for more details.
#
#   You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#
#========================================================================
#
# Version 0.3, released February 2011
#
#========================================================================

use strict;
use warnings;
use Getopt::Std;

use lib "/usr/share/BackupPC/lib";
use BackupPC::Xfer::RsyncDigest;
use BackupPC::Lib;
use File::Find;

use constant RSYNC_CSUMSEED_CACHE  => 32761;

my $default_blksize = 2048;
my $dotfreq=1000;
my %opts;
if ( !getopts("cCpavft:dVq", \%opts) || @ARGV !=1
         || (defined($opts{v}) + defined($opts{f}) > 1)
         || (defined($opts{c}) + defined($opts{C}) + defined($opts{p}) > 1)
         || (defined($opts{q}) + defined($opts{V}) > 1)) {
    print STDERR <<EOF;
usage: $0 [-c|-C|-p] [-v|-f] [-a][-V|-Q] [-d] [-t] [File or Directory]
  Verify Rsync digest in compressed files containing digests.
  Ignores directories and files without digests (firstbyte = 0xd7)
  Only prints if digest inconsistent with file content unless verbose flag
  Note: zero length files are skipped and not counted

  Options:
    -c   Consider path relative to cpool directory
    -C   Entry is a single cpool file name (no path)
    -p   Consider path relative to pc directory
    -v   Verify rsync digests
    -f   Verify & fix rsync digests if invalid/wrong
    -a   Add rsync digests if missing
    -t   TopDir
    -d   Print a '.' to STDERR for every $dotfreq digest checks
    -V   Verbose - print result of each check
         (default just prints result on errors/fixes/adds)
    -Q   Don\'t print results even with errors/fixes/adds


In non-quiet mode, the output consists of 3 columns.
  1. inode number
  2. return code:
       0 = digest added
       1 = digest ok
       2 = digest invalid
       3 = no digest
       <0 other error (see source)
  3. file name

EOF
exit(1);
}

#NOTE: BackupPC::Xfer::RsyncDigest->digestAdd opens fils O_RDWR so
#we should run as user backuppc!
die("BackupPC::Lib->new failed\n") if ( !(my $bpc = BackupPC::Lib->new) );
#die("BackupPC::Lib->new failed\n") if ( !(my $bpc = BackupPC::Lib->new("", "", 
"", 1)) ); #No user check

my $Topdir = $opts{t} ? $opts{t} : $bpc->TopDir();
$Topdir = $Topdir . '/';
$Topdir =~ s|//*|/|g;

my $root = '';
my $path;
if ($opts{C}) {
        $path = $bpc->MD52Path($ARGV[0], 1, "$Topdir/cpool");
        $path =~ m|(.*/)|;
        $root = $1; 
}
else {
        $root = $Topdir . "pc/" if $opts{p};
        $root = $Topdir . "cpool/" if $opts{c};
        $root =~ s|//*|/|g;
        $path = $root . $ARGV[0];
}

my $add = $opts{a};
my $verify = $opts{v};
my $fix = $opts{f};
my $verbose = $opts{V};
my $quiet = $opts{Q};
my $progress= $opts{d};

die "$0: Cannot read $path\n" unless (-r $path);

my $Log = \&{sub {};};

my ($totfiles, $totdigfiles, $totnodigfiles) = (0, 0, 0);
my ($totbadfiles, $totfixedfiles, $totaddedfiles) = (0, 0, 0);
find(\&verify_digest, $path); 

print STDERR "\n" if $progress;
$totaddedfiles = "NA" unless $add;
$totbadfiles = "NA" unless $verify || $fix;
$totfixedfiles = "NA" unless $fix; 
printf STDERR "Totfiles:   %s\tTotNOdigests:  %s\tTotADDEDdigests: %s\n",
        $totfiles, $totnodigfiles, $totaddedfiles;
printf STDERR "Totdigests: %s\tTotBADdigests: %s\tTotFIXEDdigests: %s\n",
        $totdigfiles, $totbadfiles, $totfixedfiles;
exit;

#########################################################################################################################
sub verify_digest {
        return -200 unless (-f);
        return -201 unless -s > 0;
        my @fstat = stat(_);
        $totfiles++;

        if ($progress && !($totfiles%$dotfreq)) {
                print STDERR "."; 
                ++$|; # flush print buffer
        }

        my $action;
        #Check whether checksum is cached (i.e. first byte not 0xd7)
        if(BackupPC::Xfer::RsyncDigest->fileDigestIsCached($_)) {
                $totdigfiles++; #Digest exists
                if($fix) { #Verify & fix
                        $action = 1;
                } elsif($verify) { #Verify only
                        $action = 2;
                } else {
                        return 4; #Don't verify or fix
                }
        } else { #Missing digest
                $totnodigfiles++;
                if($add) {
                        $action = 0; #Add missing digest
                } else { #Skip over missing digest
                        $File::Find::name =~ m|$root(.*)|;
                        printf("%d %d %s\n", (stat(_))[1], 3, $1) if $verbose;
                        return -202;
                }
        }


#Note setting blockSize=-1 in my modified version of digestAdd (based
#on the original function from RsyncDigest.pm), means that the routine
#automatically calculates the blockSize bsed on the decompressed
#fileSize.
#Also leave out final protocol_version input since by setting it undefined 
#we make it determine it automatically
        my $ret = jdigestAdd(undef, $_, -1, RSYNC_CSUMSEED_CACHE,  $action);

        $totbadfiles++ unless $ret == 1 || $ret == 0;
        $totfixedfiles++ if $ret == 2 && $action == 1;
        $totaddedfiles++ if $ret == 0 && $action == 0;

        if ($verbose || ($ret!=1 && !$quiet)) {
                $File::Find::name =~ m|$root(.*)|;
                printf "%d %d %s\n", (stat(_))[1], $ret, $1;
        }
        return $ret;
}

# Return codes:
# -100: Wrong RSYNC_CSUMSEED_CACHE or zero file size
# -101: Bad/missing RsyncLib
# -102: ZIO can't open file
# -103: sysopen can't open file
# -104: sysread can't read file
# -105: Bad first byte (not 0x78, 0xd6 or 0xd7)
# -106: Can't seek to end of data portion of file (i.e. beginning of digests)
# -107: First byte not 0xd7
# -108: Error on reading digest
# -109: Can't seek when trying to position to rewrite digest data (shouldn't 
happen if only verifying)
# -110: Can't write digest data (shouldn't happen if only verifying)
# -111: Can't seek looking for extraneous data after digest (shouldn't happen 
if only verifying)
# -112: Can't truncate extraneous data after digest (shouldn't happen if only 
verifying)
# -113: If can't sysseek back to file beginning (shouldn't happen if only 
verifying)
# -114: If can't write out first byte (0xd7) (shouldn't happen if only 
verifying)
# 1: Digest verified
# 2: Digest wrong

#-200: Not a file
#-201: Zero length file
#-202: No cached checksum

######################################################################
#The following is required to use my modified version of the digestAdd 
subroutine

use Fcntl;
use vars qw( $RsyncLibOK );
BEGIN {
    eval "use File::RsyncP;";
    if ( $@ ) {
        #
        # File::RsyncP doesn't exist.  Define some dummy constant
        # subs so that the code below doesn't barf.
        #
        $RsyncLibOK = 0;
    } else {
        $RsyncLibOK = 1;
    }
};

use constant DEF_BLK_SIZE  => 2048;
use constant MAX_BLK_SIZE => 16384;
use constant FILE_MIN => (DEF_BLK_SIZE * 10000);
use constant FILE_MAX => (MAX_BLK_SIZE * 10000);

#JJK: Revised version of digestAdd
#JJK: if $blocksize=-1, then dynamically set blocksize to the correct size
#JJK: Specifically:
#JJK: $blockSize = DEF_BLK_SIZE (2048) if $fileSize <= FILE_MIN (2048*10000);
#JJK: $blockSize = MAX_BLK_SIZE (16384) if $fileSize >= 16384;
#JJK: otherwise $blocksize = int($filesize/10000);
#JJK:           $blockSize += 4 if (($blockSize + 4) % 64) == 0 ;

# Compute and add rsync block and file digests to the given file.
#
# Empty files don't get cached checksums.
#
# If verify is set then existing cached checksums are checked.
# If verify == 2 then only a verify is done; no fixes are applied.
# 
# Returns 0 on success.  Returns 1 on good verify and 2 on bad verify.
# Returns a variety of negative values on error.
#
sub jdigestAdd
{
    my($class, $file, $blockSize, $checksumSeed, $verify,
                $protocol_version) = @_;
    my $retValue = 0;

    #
    # Don't cache checksums if the checksumSeed is not RSYNC_CSUMSEED_CACHE
    # or if the file is empty.
    #
    return -100 if ( $checksumSeed != RSYNC_CSUMSEED_CACHE || !-s $file );


        my $dynamic = 0;
    if ( $blockSize == 0 ) {
        &$Log("digestAdd: bad blockSize ($file, $blockSize, $checksumSeed)");
        $blockSize = 2048;
    }
        elsif( $blockSize == -1) { #JJK added
                $blockSize = DEF_BLK_SIZE;
                $dynamic =1;
        }
    return -101 if ( !$RsyncLibOK );
    return -102 if ( !defined(my $fh = BackupPC::FileZIO->open($file, 0, 1)) );

    my($data, $fileDigest);
start: 
        my $blockDigest = '';;
        my $nBlks = int(65536 * 16 / $blockSize) + 1;

    my $digest = File::RsyncP::Digest->new;
    $digest->protocol($protocol_version)
                        if ( defined($protocol_version) );
    $digest->add(pack("V", $checksumSeed)) if ( $checksumSeed );

   my $fileSize = 0;
    while ( 1 ) {
        $fh->read(\$data, $nBlks * $blockSize);
        $fileSize += length($data);
        last if ( $data eq "" );
                if($dynamic && $fileSize > FILE_MIN) { #JJK: Figure out file 
size & start over
                        while(1) {
                                $fh->read(\$data, $nBlks * $blockSize);
                                $fileSize += length($data);
                                last if ( $data eq ""  || $fileSize >= 
FILE_MAX);
                        }
                        $blockSize = int($fileSize/10000);
                        $blockSize = MAX_BLK_SIZE if $blockSize > MAX_BLK_SIZE;
                        $blockSize += 4 if (($blockSize + 4) % 64) == 0 ;
                        $dynamic = 0;
                        $fh->rewind;
                        goto start;
                }
                $blockDigest .= $digest->blockDigest($data, $blockSize, 16,
                                                                                
         $checksumSeed);
                $digest->add($data);
    }
    $fileDigest = $digest->digest2;
    my $eofPosn = sysseek($fh->{fh}, 0, 1);
    $fh->close;
    my $rsyncData = $blockDigest . $fileDigest;
    my $metaData  = pack("VVVV", $blockSize,
                                 $checksumSeed,
                                 length($blockDigest) / 20,
                                 0x5fe3c289,                # magic number
                        );
    my $data2 = chr(0xb3) . $rsyncData . $metaData;
#    printf("appending %d+%d bytes to %s at offset %d\n",
#                                            length($rsyncData),
#                                            length($metaData),
#                                            $file,
#                                            $eofPosn);
    sysopen(my $fh2, $file, O_RDWR) || return -103;
    binmode($fh2);
    return -104 if ( sysread($fh2, $data, 1) != 1 );
    if ( $data ne chr(0x78) && $data ne chr(0xd6) && $data ne chr(0xd7) ) {
        &$Log(sprintf("digestAdd: $file has unexpected first char 0x%x",
                             ord($data)));
        return -105;
    }
    return -106 if ( sysseek($fh2, $eofPosn, 0) != $eofPosn );
    if ( $verify ) {
        my $data3;

        #
        # Verify the cached checksums
        #
        return -107 if ( $data ne chr(0xd7) );
        return -108 if ( sysread($fh2, $data3, length($data2) + 1) < 0 );
        if ( $data2 eq $data3 ) {
            return 1;
        }
        #
        # Checksums don't agree - fall through so we rewrite the data
        #
        &$Log(sprintf("digestAdd: %s verify failed; redoing checksums; len = 
%d,%d; eofPosn = %d, fileSize = %d",
                $file, length($data2), length($data3), $eofPosn, $fileSize));
        #&$Log(sprintf("dataNew  = %s", unpack("H*", $data2)));
        #&$Log(sprintf("dataFile = %s", unpack("H*", $data3)));
        return -109 if ( sysseek($fh2, $eofPosn, 0) != $eofPosn );
        $retValue = 2;
        return $retValue if ( $verify == 2 );
    }
    return -110 if ( syswrite($fh2, $data2) != length($data2) );
    if ( $verify ) {
        #
        # Make sure there is no extraneous data on the end of
        # the file.  Seek to the end and truncate if it doesn't
        # match our expected length.
        #
        return -111 if ( !defined(sysseek($fh2, 0, 2)) );
        if ( sysseek($fh2, 0, 1) != $eofPosn + length($data2) ) {
            if ( !truncate($fh2, $eofPosn + length($data2)) ) {
                &$Log(sprintf("digestAdd: $file truncate from %d to %d failed",
                                sysseek($fh2, 0, 1), $eofPosn + 
length($data2)));
                return -112;
            } else {
                &$Log(sprintf("digestAdd: %s truncated from %d to %d",
                                $file,
                                sysseek($fh2, 0, 1), $eofPosn + 
length($data2)));
            }
        }
    }
    return -113 if ( !defined(sysseek($fh2, 0, 0)) );
    return -114 if ( syswrite($fh2, chr(0xd7)) != 1 );
    close($fh2);
    return $retValue;
}

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>