BackupPC-users

Re: [BackupPC-users] Could Digest::MD5 be broken on ARM-based - SOLUTION TO FIX broken pools computers?

2011-01-25 22:43:08
Subject: Re: [BackupPC-users] Could Digest::MD5 be broken on ARM-based - SOLUTION TO FIX broken pools computers?
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Tue, 25 Jan 2011 22:40:29 -0500
Jeffrey J. Kosowsky wrote at about 20:07:37 -0500 on Sunday, January 23, 2011:
 > Jeffrey J. Kosowsky wrote at about 19:19:54 -0500 on Sunday, January 23, 
 > 2011:
 >  > I was testing some of my md5sum routines and I kept getting weird
 >  > results on ARM-based computers.
 >  > 
 >  > Specifically, the pool file md5sum numbers were different depending on
 >  > whether I computed them under Fedora 12 on an x86 machine vs under
 >  > Debian Lenny on an ARM-based computer.
 >  > 
 >  > This obviously creates issues if you want to move your backup drive
 >  > between different CPUs.
 >  > 
 >  > I narrowed it down to Digest::MD5, by doing the following 1-liner:
 >  > perl -e 'use Digest::MD5 qw(md5_hex);$file=testfile; 
 > $size=(stat($file))[7];$body=`cat $file`; print md5_hex($size,$body) . "\n";'
 >  > 
 >  > This should be the same as:
 >  > perl -e '$file=testfile; $size=(stat($file))[7];$body=`cat $file`; print 
 > $size, $body;' | md5sum
 >  > 
 >  > For maybe 1% of files in my pool the ARM machine gave the wrong answer
 >  > when using Digest::MD5
 >  > 
 >  > So, something must be wacko in the perl implementation of Digest::MD5
 >  > on ARM machines!
 >  > 
 > 
 > Well, what do you know, Perl 5.10.0 (at least in Debian but I think
 > upstream too) are broken on ARM processors.
 > 
 > Something about 32-bit alignment.
 > You need to upgrade to 5.10.1 -- and now I wasted a day on this...
 > And now I need to write code to fix my pool - YUCK!
 > 

Well, I went through my pool carefully and it seems like the error
effects close to HALF of my pool files. This is a real mess and PITA.
BUT, I wrote a perl routine that goes through the pool and/or cpool
and corrects all the entries. Specifically, it
1. Goes through the pool and calculates the actual MD5sum path for the
   file (using my zFile2MD5 routine if it is in the cpool which avoids
   decompressing the entire file).

2. If the calculated partial file MD5sum differs from the current
   filename, then the routine finds the first empty spot in the chain
   of the corrected MD5sum. If there is already a chain there (of at
   least one file), the routine compares files (again using my faster
   zcompare routine if compressed) to see if there already is a
   match. If there is a match, then it is flagged for later correction
   by a program like my BackupPC_fixLinks.pl program. While strictly
   speaking there is no danger in having more than one copy of the
   same file in a chain (and it is necessary when nlinks > MAXLINKS),
   it is not efficient, so it is detected and flagged. Note ,though,
   in general you shouldn't have many such collisions since if the
   MD5sum was broken once it was probably broken the whole time
   (unless you switched back and forth between broken and non-broken
   Perl versions)

3. The program then renames (i.e. moves) the file and intelligently
   fills in any holes in the old chain in a way that minimizes chain
   renumbering and that preserves the relative ordering of chain
   numbering.

Note the routine can  in general be used to check and fix the
integrity of the pool/pool so it may be more generally useful

The program uses routines from my jLib.pm module and requires the
latest version that I have not yet posted (but will email if anybody
needs it).

Here though is the perl code for the routine itself:
---------------------------------------------------------------------------

#!/usr/bin/perl
#============================================================= -*-perl-*-
#
# BackupPC_fixPoolMdsums: Rename/move pool files if mdsum path name invalid
#
# DESCRIPTION
#   See 'usage' for more detailed description of what it does
#   
# AUTHOR
#   Jeff Kosowsky
#
# COPYRIGHT
#   Copyright (C) 2011  Jeff Kosowsky
#
#   This program is free software; you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation; either version 2 of the License, or
#   (at your option) any later version.
#
#   This program is distributed in the hope that it will be useful,
#   but WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#   GNU General Public License for more details.
#
#   You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#
#========================================================================
#
# Version 0.1, released January 2011
#
#========================================================================

use strict;
use warnings;

use lib "/usr/share/BackupPC/lib";
use BackupPC::Lib;
use BackupPC::jLib 0.4.0;  # Requires version >= 0.4.0
use File::Glob ':glob';
use Getopt::Long qw(:config no_ignore_case bundling);

my $bpc = BackupPC::Lib->new or die("BackupPC::Lib->new failed\n");
%Conf   = $bpc->Conf(); #Global variable defined in jLib.pm (do not use 'my')

my $TopDir = $Conf{TopDir};
$TopDir =~ s|/+|/|;
$TopDir =~ s|/*$|/|; #End with just one slash
my $compress = $Conf{CompressLevel};
my $pool = $compress > 0 ? "cpool" : "pool";
my $compare = $compress > 0 ? \&zcompare2 : \&jcompare;
my $file2md5 = $compress > 0 ? \&zFile2MD5 : \&File2MD5;
my $md5 = Digest::MD5->new;
my $MAXLINKS = $bpc->{Conf}{HardLinkMax};

#Option variables:
my $nodups;
my $outfile;
my $verbose=0;
#$dryrun=1;  #Global variable defined in jLib.pm (do not use 'my')
$dryrun=0;  #Global variable defined in jLib.pm (do not use 'my')


usage() unless( 
        GetOptions( 
                "dryrun|d!"        => \$dryrun,
                "nodups|n"         => \$nodups,    #Treat dups as errors
                "outfile|o=s"      => \$outfile,
                "verbose|v+"       => \$verbose,   #Verbosity (repeats allowed)
        ));

my ($OUT);
die "ERROR: '$outfile' already exists!\n" if -e $outfile;
open($OUT, '>', "$outfile") or
        die "ERROR: Can't open '$outfile' for writing!($!)\n";

chdir $TopDir;

my @partialbackups = glob("pc/*/NewFileList");
die("Error: Pool conflicts will occur if NewFileList present:\n          " . 
        join('\n          ', @partialbackups) . "\n") if @partialbackups;

system("$bpc->{InstallDir}/bin/BackupPC_serverMesg status jobs >/dev/null 
2>&1");
die "Dangerous to run when BackupPC is running!!!\n"
        unless ($? >>8) == 1;

my $total = 0;
my $errors = 0;
my $fixed = 0;
my $chaindups = 0;
my $norename = 0;
my $norenumber = 0;

scan_pool($pool);

printf("Total=%d Errors=%d [Fixed=%d, NotFixed=%d]%s\n",
           $total, $errors, $fixed, ($errors-$fixed), $dryrun ?" DRY-RUN" : "");
printf("Chaindups=%d NoRename=%d NoRenumber=%d\n",
           $chaindups, $norename, $norenumber);
exit;

#######################################################################
#Run through the pool looking for misnamed md5sum paths
sub scan_pool
{
        my ($fpool) = @_;
        my ($dh, @fstat);

        return unless glob("$fpool/[0-9a-f]"); #No entries in pool
        my @hexlist = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 
'b', 'c', 'd', 'e', 'f');
        my ($idir,$jdir,$kdir);
        foreach my $i (@hexlist) {
                print STDERR "\n**$fpool/$i: " if $verbose >=2;
                $idir = $fpool . "/" . $i . "/";
                foreach my $j (@hexlist) {
                        print STDERR "$j " if $verbose >=2;
                        $jdir = $idir . $j . "/";
                        foreach my $k (@hexlist) {
                                $kdir = $jdir . $k . "/";
                                unless(opendir($dh, $kdir)) {
                                        warn "Can't open pool directory: 
$kdir\n" if $verbose>=4;
                                        next;
                                }
                                #Sort directory entries so that chains are 
ordered lowest to
                #highest - This preserves sequential order between source and 
                #target chains PLUS ensures that we fill holes corretly and
                                #most efficiently
                                my @entries = sort {poolname2number($a) cmp 
poolname2number($b)}
                                                    (readdir ($dh));
                                close($dh);
                                warn "POOLDIR: $kdir (" . ($#entries-1) ." 
files)\n"
                                        if $verbose >=3;

                                my $chaindeletes = 0;
                                my $chainstart;
                                my $lastdigest='';
                                foreach (@entries) {
                                        next if /^\.\.?$/;     # skip dot files 
(. and ..)

                                        my $origfile = ${kdir} . $_;
                                        unless(m|^([0-9a-f]+)(_[0-9]*)?|) {
                                                warn "ERROR: '$origfile' is not 
a valid pool entry\n";
                                                next;
                                        }
                                        $total++;
                                        my $origdigest = $1;
                                        my $newdigest = $file2md5->($bpc, $md5, 
$origfile, -1, $compress);
                                        if($newdigest eq "-1") {
                                                warn "ERROR: Can't calculate 
md5sum name for: $origfile\n";
                                                next;
                                        }
                                        if($newdigest ne $origdigest) {
                                                $errors++;
                                                if($origdigest ne $lastdigest) 
{ #New chain
                                                        #So go back and 
renumber last chain to remove holes
                                                        
renumber_pool_chain($chainstart, $chaindeletes)
                                                                if 
$chaindeletes > 0;
                                                        
$lastdigest=$origdigest; #Reset to new chain base
                                                        $chaindeletes = 0;
                                                        $chainstart = 
$origfile; #lowest element of chain
                                                        #since we are sorting 
directory in chain order
                                                }
                                                if(fix_entry($origfile, 
$newdigest)==1) {$chaindeletes++}
                                        }
                                }
                                #Check in case chain still going when 'foreach' 
ran out of...
                                renumber_pool_chain($chainstart, $chaindeletes)
                                        if $chaindeletes > 0;
                        }
                }
        }
}

#Rename/move pool chain entry $source to first open position
#in $digest chain if permitted. Renumber source chain as
#needed after the move
sub fix_entry
{
        my ($source, $digest) = @_;

        my $i=-1;
        my @dups = ();
                my $poolpath = my $poolbase = $bpc->MD52Path($digest,$compress);
                while( -f $poolpath ) { # Iterate through pool chain with same 
md5sum
                        if((stat(_))[3] < $MAXLINKS &&
                           ! $compare->($source,$poolpath)) { #Matches existing 
pool entry
                                push(@dups,$i);
                        }
                        $poolpath = $poolbase . "_" . ++$i;
                }
        my $dups = @dups ? ' CHAINDUPS(' . join(',', @dups) . ')': '';
        $poolpath =~ m|^$TopDir/?(.*)|;
        my $target = $1;
#       print "$source $target [$errors/$total]$dups\n";

        if(@dups) {
                warn "WARN: $dups: $source->$target\n" if $verbose >=1;
                $chaindups++;
                if($nodups) { #Don't fix dups - no changes to pool
                        print $OUT "$source $target $dups\n";
                        return;
                }
        }

        if(-e $target || !jrename($source,$target)) { #Not renamed
                warn "ERROR: Can't rename: $source->$target\n" if $verbose >=1;
                print $OUT "$source $target NO_RENAME$dups\n";
                $norename++;
                return -1;
                }

#       unless(delete_pool_file($source)==1) { #Renamed but source chain not 
renumbered
#               warn "ERROR: Can't renumber after rename: $source --> $target\n"
#                       if $verbose >=1;
#               print $OUT "$source $target NO_RENUMBER$dups\n";
#               $norenumber++;
#               return -2;
#               }
        #Fixed without errors
        print $OUT "$source $target FIXED$dups\n";
        $fixed++;
        return 1;
}


sub usage
{
    print STDERR <<EOF;

usage: $0 [options] --outfile|-o <outfile>  
  Options:
   --dryrun|-d         Dry-run 
                       Negate with: --nodryrun
   --nodups|-n         Don\'t rename/remove if file with same contents found in
                       target chain (see below for details)
   --verbose|-v        Verbose (repeat for more verbosity)

  DESCRIPTION:
    Find and fix md5sum pool name errors in pool and cpool

  DETAILS:
    Recurses through pool and cpool trees to test if the md5sum name of each
    pool file is correct relative to the file data. If not, the program attempts
    to rename (i.e. move) it to its proper md5sum name.

        If there already are pool files with the new name, then move it to
    the end of the target chain. After removing, renumber the source
    chain (if needed) to fill in holes left by the move. Note that the relative
    ordering of each chain is preserved.

    If the contents of the file match the contents of any of the files in the
    target chain, note the duplicate suffix numbers.

    If the --nodups|-n flag is set then don\'t rename the pool file in this case
    and just note where it would have gone if there were no chain dups.

    Note: it is not generally an error to have two pool entries in the same
    chain with the same data (in fact, it occurs intentionally when you exceed 
    MAX LINKS), it just may waste some space. My routine BackupPC_fixLinks.pl 
    can correct just such duplicates later if that is an issue.

    In any case, if all your misnumbering was consistent you won\'t have this
    situation anyway.

        <outfile> records all the changes made plus appends a status code:

        FIXED = pool file moved/renamed and original chain renumbered if needed.
        DUPS(n1,n2,n3) = Signals duplicates in the target chain and lists the
                     suffixes (-1 = no suffix)
                                         Whether or not file was actually moved 
in this case
                     (and hence whether the mdsum was fixed) depends on the
                     value of the --nodups flag.
    NO_RENAME = Signals error in renaming/moving the pool file. The mdsum name
                was thus not corrected.
    NO_RENUMBER = The pool file was renamed/moved *but* error in renumbering the
                  source chain to fill in the hole left by the move.
EOF
exit(1)
}

------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>