Re: [BackupPC-users] Backing up a BackupPC server

Les Mikesell wrote at about 20:39:07 -0500 on Wednesday, June 3, 2009:
 > Pieter Wuille wrote:
 > > On Wed, Jun 03, 2009 at 07:36:22PM -0400, Jeffrey J. Kosowsky wrote:
 > >> Holger Parplies wrote at about 23:45:35 +0200 on Wednesday, June 3, 2009:
 > >>  > Hi,
 > >>  > 
 > >>  > Peter Walter wrote on 2009-06-03 16:15:37 -0400 [Re: [BackupPC-users] 
 > >> Backing up a BackupPC server]:
 > >>  > > [...]
 > >>  > > My understanding is that, if it were not for the 
 > >>  > > hardlinks, that rsync transfers to another server would be more 
 > >>  > > feasible;
 > >>  > 
 > >>  > right.
 > >>  > 
 > >>  > > that processing the hardlinks requires significant cpu 
 > >>  > > resources, memory resources, and that access times are very slow, 
 > >>  > 
 > >>  > Memory: yes. CPU: I don't think so. Access times very slow? Well, the 
 > >> inodes
 > >>  > referenced from one directory are probably scattered all over the 
 > >> place, so
 > >>  > traversing the file tree (e.g. "find $TopDir -ls") is probably slower 
 > >> than
 > >>  > in "normal" directories. Or do you mean swapping slows down memory 
 > >> accesses
 > >>  > by several orders of magnitude?
 > >>  > 
 > >>  > > compared to processing ordinary files. Is my understanding correct? 
 > >> If 
 > >>  > > so, then what I would think of doing is (a) shutting down backuppc 
 > >> (b) 
 > >>  > > creating a "dump" file containing the hardlink metadata (c) backing 
 > >> up 
 > >>  > > the pooled files and the dump file using rsync (d) restarting 
 > >> backuppc. 
 > >>  > > I really don't need a live, working copy of the backuppc file system 
 > >> - 
 > >>  > > just a way to recreate it from a backup if necessary, using an 
 > >> "undump" 
 > >>  > > program that recreated the hardlinks from the dump file. Is this 
 > >>  > > approach feasible?
 > >>  > 
 > >>  > Yes. I'm just not certain how you would test it. You can undoubtedly
 > >>  > restore your pool to a new location, but apart from browsing a few 
 > >> random
 > >>  > files, how would you verify it? Maybe create a new "dump" and compare 
 > >> the
 > >>  > two ...
 > >>  > 
 > >>  > Have you got the resources to try this? I believe I've got most of the 
 > >> code
 > >>  > we'd need. I'd just need to take it apart ...
 > >>  > 
 > >>
 > >> Holger, one thing I don't understand is that if you create a dump
 > >> table associating inodes with pool file hashes, aren't we back in the
 > >> same situation as using rsync -H? i.e., for large pool sizes, the
 > >> table ends up using up all memory and bleeding into swap which means
 > >> that lookups start taking forever causing the system to
 > >> thrash. Specifically, I would assume that rsync -H basically is
 > >> constructing a similar table when it deals with hard links, though
 > >> perhaps there are some savings in this case since we know something
 > >> about the structure of the BackupPC file data -- i.e., we know that
 > >> all the hard links have as one of their links a link to a pool file.
 > >>
 > > [...]
 > >> This would allow the entire above algorithm to be done in O(mlogm)
 > >> time with the only memory intensive steps being those required to sort
 > >> the pool and pc tables. However, since sorting is a well studied
 > >> problem, we should be able to use memory efficient algorithms for
 > >> that.
 > > 
 > > You didn't use the knowledge that the files in the pool have names that 
 > > correspond (apart from a few hashchains) to the partial md5sums of the
 > > data in them, like BackupPC_tarPCcopy does. I've never used/tested this
 > > tool, but if i understand it correctly, it builds a tar file that contains
 > > symbolic hardlinks to the pool directory, instead of the actual data.
 > > This combined with with a verbatim copy of the pool directory itself, 
 > > should
 > > suffice to copy the entire topdir in O(m+n) time and O(1) memory (since a 
 > > lookup of what pool file a certain hardlinked file in a pc/ dir points to,
 > > can be done in O(1) time and space (except for a sporadic hash chain)).
 > > In practice however, doing the copy on the blocklevel will be significantly
 > > faster still, because no continuous seeking is required.
 > > 
 > >> I would be curious to know how how in the real world the time (and
 > >> memory usage) compares to copy over a large (say multi Terabyte)
 > >> BackupPC topdir varies for the following methods:
 > >>
 > >> 1. cp -ad
 > >> 2. rsync -H
 > >> 3. Copy using a single table of pool inode numbers
 > >> 4. Copy using a sorted table of pool inode numbers and pc hierarchy
 > >>    inode numbers
 > > Add:
 > >   5. copy the pooldir and use tarPCcopy for the rest
 > >   6. copy the blockdevice
 > 
 > And for extra points, figure out which variations could be adapted to 
 > incremental updates as you might want to do to keep an offsite copy in 
 > sync.  The blockdevice approach would probably require ZFS with its 
 > snapshot send/receive functions.  The tarPCcopy approach would need to 
 > catch all files under directories newer than the previous run - and 
 > maybe track current directory contents for deletions which is one of the 
 > Gnutar extensions.

You need to be careful about hash chain renumbering which could mess
things up if you are just looking at file names and file modification
dates. Would gnutar handle this properly without having to run through
all the hard links? Some of the simpler methods would fail here
without additional logic.

For the inode table methods that Holger and I have been talking about,
you might be able to do something like the following:

1. Run rsync (without the -H) in the --dry-run mode to identify which
   files in the pool and pc hierarchy have changed.

2. Make the deletions in the target pool and pc directories as per the
   output of the dry-run rsync.

3. Generate a table of all the inodes for the changed/added files in
   the output of rsync (both in the pool and pc hierarchies). Add to
   this table any additional inodes in the pc directory that link to
   changed/added files in the pool (this is needed since a renumbering
   of a pool hash chain affects all pc files linked to it)
   [You can either keep the pool and pc hierarchy tables separate or
   combined]

4. Sort the table(s)

5. Copy the pool files and make the links indicated by the table(s).

Note if bandwidth is limiting, you could potentially move the
to-be-deleted pool files and calculate their md5sums in case the
deletion is actually just due to hash chain renumbering in which case
in step #5 if a pool file has a modification date older than the last
backup copy, you can calculate its md5sum and match it with a
to-be-deleted member of the old pool and rename it accordingly.

[I imagine there are other similar optimizations that I have missed...]

Operationally this requires
- A dry-run of rsync (without hard links) on the full pool and pc
  directories (this should be relatively fast)
- Deletions - fast
- Another run through the pc hierarchy to find the inodes linked to
  changed/added pool files
- Sort of the changed files - O(k log k) where k= number of
  changed/added files
- Copy of changed/added pool files
- Link of changed pc hierarchy files

I imagine this would be significantly faster than a straightforward
tarPCcopy approach since tracking of the hard links is simplified.

------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises 
looking to deploy the next generation of Solaris that includes the latest 
innovations from Sun and the OpenSource community. Download a copy and 
enjoy capabilities such as Networking, Storage and Virtualization. 
Go to: http://p.sf.net/sfu/opensolaris-get
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/