BackupPC-users

Re: [BackupPC-users] Advice on creating duplicate backup server

2008-12-08 09:39:18
Subject: Re: [BackupPC-users] Advice on creating duplicate backup server
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Mon, 08 Dec 2008 09:37:16 -0500
Stuart Luscombe wrote at about 10:02:04 +0000 on Monday, December 8, 2008:
 > Hi there,
 > 
 >  
 > 
 > I've been struggling with this for a little while now so I thought it about
 > time I got some help!
 > 
 >  
 > 
 > We currently have a server running BackupPC v3.1.0 which has a pool of
 > around 3TB and we've got to a stage where a tape backup of of the pool is
 > taking 1-2 weeks, which isn't effective at all.  The decision was made to
 > buy a server that is an exact duplicate of our current one and have it
 > hosted in another building, as a 2 week old backup isn't ideal in the event
 > of a disaster.
 > 
 >  
 > 
 > I've got the OS (CentOS) installed on the new server and have installed
 > BackupPC v3.1.0, but I'm having problems working out how to sync the pool
 > with the main backup server. I managed to rsync the cpool folder without any
 > real bother, but the pool folder is the problem, if I try an rsync it
 > eventually dies with an 'out of memory' error (the server has 8GB), and a cp
 > -a didn't seem to work either, as the server filled up, assumedly as it's
 > not copying the hard links correctly?
 > 
 >  
 > 
 > So my query here really is am I going the right way about this? If not,
 > what's the best method to take so that say once a day the duplicate server
 > gets updated.
 > 
 >  
 > 
 > Many Thanks

It just hit me that given the known architecture of the pool and cpool
directories shouldn't it be possible to come up with a scheme that
works better than either rsync (which can choke on too many hard
links) and 'dd' (which has no notion of incremental and requires you
to resize the filesystem etc.).

My thought is as follows:
1. First, recurse through the pc directory to create a list of
   files/paths and the corresponding pool links.
   Note that finding the pool links can be done in one of several
   ways:
   - Method 1: Create a sorted list of pool files (which should be
     significantly shorter than the list of all files due to the
         nature of pooling and therefore require less memory than rsyn)
         and then look up the links.
   - Method 2: Calculate the md5sum file path of the file to determine
         out where it is in the pool. Where necessary, determine among
         chain duplicates
   - Method 3: Not possible yet but would be possible if the md5sum
     file paths were appended to compressed backups. This would add very
         little to the storage but it would allow you to very easily
         determine the right link. If so then you could just read the link
         path from the file. 

  Files with only 1 link (i.e. no hard links) would be tagged for
  straight copying.

2. Then rsync *just* the pool -- this should be no problem since by
   definition there are no hard links within the pool itself

3. Finally, run through the list generated in #1 to create the new pc
   directory by creating the necessary links (and for files with no
   hard links, just copy/rsync them)

The above could also be easily adapted to allow for "incremental" syncing.
Specifically, in #1, you would use rsync to just generate a list of
*changed* files in the pc directory. In #2, you would continue to use
rsync to just sync *changed* pool entries. In #3 you would only act on
the shortened incremental sync list generated in #1.

The more I think about it, the more I LIKE the idea of appending the
md5sums file paths to compressed pool files (Method #3) since this
would make the above very fast. (Note if I were implementing this, I
would also include the chain number in cases where there are multiple
files with the same md5sum path and of course then BackupPC_nightly
would have to adjust this any time it changed around the chain
numbering).

Even without the above, Method #1 would still be much less memory
intensive than rsync and Method #2 while potentially a little slow
would require very little memory and wouldn't be nearly that bad if
you are doing incremental backups.

------------------------------------------------------------------
Just as any FYI, if anyone wants to implement method #2, here is the
routine I use to generate the md5sum file path from a (compressed)
file (note that it is based on the analogous uncompressed version in
Lib.pm).

use BackupPC::Lib;
use BackupPC::Attrib;
use BackupPC::FileZIO;

use constant _128KB               => 131072;
use constant _1MB                 => 1048576;

# Compute the MD5 digest of a compressed file. This is the compressed
# file version of the Lib.pm function File2MD5.
# For efficiency we don't use the whole file for big files
#   - for files <= 256K we use the file size and the whole file.
#   - for files <= 1M we use the file size, the first 128K and
#     the last 128K.
#   - for files > 1M, we use the file size, the first 128K and
#     the 8th 128K (ie: the 128K up to 1MB).
# See the documentation for a discussion of the tradeoffs in
# how much data we use and how many collisions we get.
#
# Returns the MD5 digest (a hex string).
#
# If $filesize < 0 then always recalculate size of file by fully decompressing
# If $filesize = 0 then first try to read corresponding attrib file
#    (if it exists), if doesn't work then recalculate
# IF $filesize >0 then use that as the size of the file

sub zFile2MD5
{
    my($bpc, $md5, $name, $filesize, $compresslvl) = @_;
        
        my $fh;
        my $rsize;
        my $totsize;

        $compresslvl = $Conf{CompressLevel} unless defined $compresslvl;
        unless (defined ($fh = BackupPC::FileZIO->open($name, 0, 
$compresslvl))) {
                printerr "Can't open $name\n";
                return -1;
        }
        
        my $datafirst = my $datalast = '';
        my @data = ('','');
        #First try to read up to the first 128K (131072 bytes)
        if ( ($totsize = $fh->read(\$datafirst, _128KB)) < 0 ) {
                printerr "Can't read & decompress $name\n";
                return -1;
        }
        elsif ($totsize == _128KB) { # Read up to 1st MB
                my $i=0;
                #Read in up to 1MB (_1MB), 128K at a time and alternate between 
2 data buffers
                while ( ($rsize = $fh->read(\$data[(++$i)%2], _128KB) == _128KB)
                        &&  ($totsize += $rsize) < _1MB) {}
                $totsize +=$rsize if $rsize < _128KB; # Add back in partial read
            $datalast = substr($data[($i-1)%2], $rsize, _128KB-$rsize)
                        . substr($data[$i%2], 0 ,$rsize);
    }
    $filesize = $totsize if $totsize < _1MB; #Already know the size because 
read it all
    if ($filesize == 0) { # Try to find size from attrib file
                $filesize = get_attrib_value($bpc, $name, "size");
                warn "Can't read size of $name from attrib file so calculating 
manually\n" unless defined $filesize;
        }
    unless ($filesize > 0) { #continue reading to calculate size
                while (($rsize = $fh->read(\($data[0]), _128KB)) > 0) {
                    $totsize +=$rsize
        }
        $filesize = $totsize;
   } 
   $fh->close();

        $md5->reset();
    $md5->add($filesize);
    $md5->add($datafirst);
    ($datalast eq '') || $md5->add($datalast);
    return $md5->hexdigest;
}

# Returns value of attrib $key for $fullfilename (full path)
# If attrib file not present or there is not an entry for 
# the specificed key for the given file, then return 'undef'
sub get_attrib_value
{
        my ($fullfilename, $key) = @_;
        $fullfilename =~ m{(.+)/f(.+)};  #1=dir; $2=file

        return undef if read_attrib(my $attr, $1) < 0;
        return $attr->{files}{$2}{$key}; #Note this returns undefined if key 
not present
}

#Reads in the attrib file for directory $_[1] and (optional alternative attrib 
file name $_[2]) and 
#stores it in the hashref $_[0] passed to the function
#Returns -1 and a blank $_{0] hash ref if attrib file doesn't exist already 
(not necessarily an error)
#Dies if attrib file exists but can't be read in.
sub read_attrib
{ #Note: $_[0] = hash reference to attrib object
        $_[0] = BackupPC::Attrib->new({ compress => $Conf{CompressLevel} });
        return -1 unless -f attrib($_[1], $_[2]);  #This is not necessarily an 
error because dir may be empty
        die "Error: Cannot read attrib file: " . attrib($_[1],$_[2]) . "\n" 
unless $_[0]->read($_[1],$_[2]);
        return 1;
}

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/