BackupPC-users

Re: [BackupPC-users] Backing up a BackupPC server

2009-06-03 23:04:12
Subject: Re: [BackupPC-users] Backing up a BackupPC server
From: Les Mikesell <les AT futuresource DOT com>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Wed, 03 Jun 2009 20:39:07 -0500
Pieter Wuille wrote:
> On Wed, Jun 03, 2009 at 07:36:22PM -0400, Jeffrey J. Kosowsky wrote:
>> Holger Parplies wrote at about 23:45:35 +0200 on Wednesday, June 3, 2009:
>>  > Hi,
>>  > 
>>  > Peter Walter wrote on 2009-06-03 16:15:37 -0400 [Re: [BackupPC-users] 
>> Backing up a BackupPC server]:
>>  > > [...]
>>  > > My understanding is that, if it were not for the 
>>  > > hardlinks, that rsync transfers to another server would be more 
>>  > > feasible;
>>  > 
>>  > right.
>>  > 
>>  > > that processing the hardlinks requires significant cpu 
>>  > > resources, memory resources, and that access times are very slow, 
>>  > 
>>  > Memory: yes. CPU: I don't think so. Access times very slow? Well, the 
>> inodes
>>  > referenced from one directory are probably scattered all over the place, 
>> so
>>  > traversing the file tree (e.g. "find $TopDir -ls") is probably slower than
>>  > in "normal" directories. Or do you mean swapping slows down memory 
>> accesses
>>  > by several orders of magnitude?
>>  > 
>>  > > compared to processing ordinary files. Is my understanding correct? If 
>>  > > so, then what I would think of doing is (a) shutting down backuppc (b) 
>>  > > creating a "dump" file containing the hardlink metadata (c) backing up 
>>  > > the pooled files and the dump file using rsync (d) restarting backuppc. 
>>  > > I really don't need a live, working copy of the backuppc file system - 
>>  > > just a way to recreate it from a backup if necessary, using an "undump" 
>>  > > program that recreated the hardlinks from the dump file. Is this 
>>  > > approach feasible?
>>  > 
>>  > Yes. I'm just not certain how you would test it. You can undoubtedly
>>  > restore your pool to a new location, but apart from browsing a few random
>>  > files, how would you verify it? Maybe create a new "dump" and compare the
>>  > two ...
>>  > 
>>  > Have you got the resources to try this? I believe I've got most of the 
>> code
>>  > we'd need. I'd just need to take it apart ...
>>  > 
>>
>> Holger, one thing I don't understand is that if you create a dump
>> table associating inodes with pool file hashes, aren't we back in the
>> same situation as using rsync -H? i.e., for large pool sizes, the
>> table ends up using up all memory and bleeding into swap which means
>> that lookups start taking forever causing the system to
>> thrash. Specifically, I would assume that rsync -H basically is
>> constructing a similar table when it deals with hard links, though
>> perhaps there are some savings in this case since we know something
>> about the structure of the BackupPC file data -- i.e., we know that
>> all the hard links have as one of their links a link to a pool file.
>>
> [...]
>> This would allow the entire above algorithm to be done in O(mlogm)
>> time with the only memory intensive steps being those required to sort
>> the pool and pc tables. However, since sorting is a well studied
>> problem, we should be able to use memory efficient algorithms for
>> that.
> 
> You didn't use the knowledge that the files in the pool have names that 
> correspond (apart from a few hashchains) to the partial md5sums of the
> data in them, like BackupPC_tarPCcopy does. I've never used/tested this
> tool, but if i understand it correctly, it builds a tar file that contains
> symbolic hardlinks to the pool directory, instead of the actual data.
> This combined with with a verbatim copy of the pool directory itself, should
> suffice to copy the entire topdir in O(m+n) time and O(1) memory (since a 
> lookup of what pool file a certain hardlinked file in a pc/ dir points to,
> can be done in O(1) time and space (except for a sporadic hash chain)).
> In practice however, doing the copy on the blocklevel will be significantly
> faster still, because no continuous seeking is required.
> 
>> I would be curious to know how how in the real world the time (and
>> memory usage) compares to copy over a large (say multi Terabyte)
>> BackupPC topdir varies for the following methods:
>>
>> 1. cp -ad
>> 2. rsync -H
>> 3. Copy using a single table of pool inode numbers
>> 4. Copy using a sorted table of pool inode numbers and pc hierarchy
>>    inode numbers
> Add:
>   5. copy the pooldir and use tarPCcopy for the rest
>   6. copy the blockdevice

And for extra points, figure out which variations could be adapted to 
incremental updates as you might want to do to keep an offsite copy in 
sync.  The blockdevice approach would probably require ZFS with its 
snapshot send/receive functions.  The tarPCcopy approach would need to 
catch all files under directories newer than the previous run - and 
maybe track current directory contents for deletions which is one of the 
Gnutar extensions.

-- 
    Les Mikesell
     lesmikesell AT gmail DOT com




------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises 
looking to deploy the next generation of Solaris that includes the latest 
innovations from Sun and the OpenSource community. Download a copy and 
enjoy capabilities such as Networking, Storage and Virtualization. 
Go to: http://p.sf.net/sfu/opensolaris-get
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/