BackupPC-users

Re: [BackupPC-users] Advice on creating duplicate backup server

2008-12-09 00:53:30
Subject: Re: [BackupPC-users] Advice on creating duplicate backup server
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: Holger Parplies <wbppc AT parplies DOT de>
Date: Tue, 09 Dec 2008 00:49:27 -0500
Holger Parplies wrote at about 04:10:17 +0100 on Tuesday, December 9, 2008:
 > Hi,
 > 
 > Jeffrey J. Kosowsky wrote on 2008-12-08 09:37:16 -0500 [Re: [BackupPC-users] 
 > Advice on creating duplicate backup server]:
 > > 
 > > It just hit me that given the known architecture of the pool and cpool
 > > directories shouldn't it be possible to come up with a scheme that
 > > works better than either rsync (which can choke on too many hard
 > > links) and 'dd' (which has no notion of incremental and requires you
 > > to resize the filesystem etc.).
 > 
 > yes, that hit someone on the list several years ago (I don't remember the
 > name, sorry). I implemented the idea he sketched (well, more or less, there's
 > some work left to make it really useful).
 > 
 > > My thought is as follows:
 > > 1. First, recurse through the pc directory to create a list of
 > >    files/paths and the corresponding pool links.
 > >    Note that finding the pool links can be done in one of several
 > >    ways:
 > >    - Method 1: Create a sorted list of pool files (which should be
 > >      significantly shorter than the list of all files due to the
 > >     nature of pooling and therefore require less memory than rsyn)
 > >     and then look up the links.
 > 
 > Wrong. You need one entry per inode that points to an arbitrary path (the
 > first one you copy). Every file(*) is in the pool, meaning a list of all pool
 > files is exactly what you need. A different way to look at it: every file 
 > with
 > a link count > 1 is a pooled file, and it's these files that cause rsync&co
 > problems, not single link files. (Well, yes, rsync pre-3 needed a complete
 > list of all files.)
OK. I had assumed (wrongly) that rsync needed to keep track of each
file that is hard-linked, not just one copy.
Still, there are some savings by knowing that you can find your one
copy in the pool and you don't have to look at all through the pc tree.
 > 
 > (*) Files that are not in the pool:
 >     1.) 0-byte files. They take up no file system blocks, so pooling them
 >         saves only inodes. Not pooling them makes things simpler.
 >     2.) log files (they get appended to; that would make pooling somewhat
 >         difficult; besides, what chance is there of a pool hit?),
 >         backups files (including backups.old)
 >     attrib files are pooled, contrary to popular belief, and that makes
 >     sense, because they are often identical with the same attrib file from
 >     the previous backup(s).
Yes. I am aware of this from the routines I wrote to check/fix pool
consistency and missing links to the pool
 > 
 > 
 > The algorithm I implemented is somewhat similar:
 > 1.) Walk pool/, cpool/ and pc/, printing information on the files and
 >     directories to a file (which will be quite large; by default I put it
 >     on the destination pool FS, because there should be large amounts of
 >     space there).
 > 2.) Sort the file with the 'sort' command. The lines in the file are
 >     designed such that they will be sorted into a meaningful order:
 >     - directories first, so I can create them and subsequently not worry
 >       about whether the place I want to copy/link a file to already exists
 >       or not
 >     - files next, sorted by inode number, with the (c)pool file preceeding 
 > its
 >       pc/ links
 >       The consequence is that I get all references to one inode on adjacent
 >       lines. The first time, I copy the file. For the repetitions, I link to
 >       the first copy. All I need to keep in memory is something like one line
 >       from the file list, one "previous inode number", one "file name of
 >       previous inode".
 >     'sort' handles huge files quite nicely, but it seems to create large
 >     (amounts of) files under /tmp, possibly under $TMPDIR if you set that 
 > (not
 >     sure). You need to make sure you've got the space, but if you're copying 
 > a
 >     multi-GB/TB pool, you probably have. My guess is that the necessary 
 > amount
 >     of space roughly equals the size of the file I'm sorting.
 > 3.) Walk the sorted file, line by line, creating directories and copying 
 > files
 >     (with File::Copy::cp, but I plan to change that to PoolWrite, so I can 
 > add
 >     (part of) one pool to an existing second pool, or something that
 >     communicates over TCP/IP, so I can copy to a different machine) and
 >     linking files (with Perl function link()).
 >     In theory, a pool could also be compressed or uncompressed on the fly
 >     (uncompressed for copying to zfs, for instance).

Yes... I was thinking very similarly though you have clearly fleshed
it out and coded it in more detail. I would love to see the code when
it's stable and offer feedback or testing if that is helpful.
 > 
 > 
 > Once again, because people seem to be determined to miss the point: it's 
 > *not*
 > processing by sorted inode numbers in order to save disk seeks that is the
 > point, it's the fact that the 'link' system call takes two paths
 > 
 >     link $source_path, $dest_path; # to use Perl notation
 > 
 > while the 'stat' system call gives you only an inode number. To link a
 > filename to a previously copied inode, you need to know the name you copied 
 > it
 > to. A general purpose tool can't know when it will need the information, so 
 > it
 > needs to keep information on all inodes with link count > 1 it has 
 > encountered.
 > You can keep a mapping of inode_number->file_name in memory for a few 
 > thousand
 > files, but not for hundreds of millions. By sorting the list by inode number,
 > I can be sure that I'll never need the info for one inode again once I've
 > reached the next inode, so I only have to keep info for one file in memory,
 > regardless of how many I'm copying. The difficult part is now the 'sort', 
 > but,
 > again, the 'sort' command is good at handling huge files - probably without
 > limit to the file size.

Exactly.
 > 
 > 
 > So, what's the problem? Well, I used it once, because I needed to copy a 
 > pool.
 > It seemed to Work For Me (tm), but I'm unsure how to verify the result, aside
 > from randomly looking at a few files and hoping the 99.99999% I didn't look 
 > at
 > are ok too. 

Why not use rsync to test it? (with the -n flag for dry-run, -v to see
changes it would have made, and -delete to check both directions)
And if your pool is too big, just test it on a smaller example since
unlike rsync, the method you and I are discussing should be pretty
scalable so if it works on a reasonably complex 10GB pool, it should
also work on a TB one... (hopefully)

 > It's far from complete. Its usefulness is limited, as long as I
 > can't copy to a remote machine. It only handles the pool/, cpool/ and pc/
 > directories, the rest needs to be copied by hand. There is debug output for
 > cases I hope to not encounter but which may be present in other people's
 > pools. I think I'm still missing a chown() or two.

What more is there beyond pool/cpool/pc other than Trash which you
presumably don't need to synch and .ssh which is small and should
rarely change?
 > 
 > Let's see if I can find the figures. My pool was 103 GB, 10 million directory
 > entries pointing to 4 million inodes. Copy from local disk to ISCSI target
 > over shared 100 MBit network in 10:45 hours. Of this time, 15 minutes were
 > spent examining cpool (pool was empty), 73 minutes examining pc/, 165 seconds
 > sorting file list (1.1GB) and 9:14 hours copying/linking.
 > rsync might have worked for this pool, but I didn't test. I would be very
 > curious how this scales to a 3TB pool though ;-).
 > 
 > 
 > Question (to a tar expert named Craig or otherwise):
 > Is it possible to create a tar stream with this structure (i.e. lots of
 > directories, then file 1/2/3/123whatever with content, then several links in
 > different directories under pc/ to this file, then next pool file and so on),
 > or does a tar need to be sorted by directories?
 > 
 > If it *is* possible, creating a tar stream instead of copying/linking would
 > not be difficult, and then you could run
 > 
 >     BackupPC_copyPool ... | ssh hostname tar xpf -
 > 
 > (or via netcat, or even store the result on tape). Even merging into an
 > existing pool could be split into a BackupPC_mergePool script which takes a
 > tar stream and does whatever is necessary.
 > 
 > >    - Method 2: Calculate the md5sum file path of the file to determine
 > >             out where it is in the pool. Where necessary, determine among
 > >     chain duplicates
 > 
 > That's basically what BackupPC_tarPCCopy does.

So why not use this? Or is it too slow since you need to create the
path md5sum?
 > 
 > >    - Method 3: Not possible yet but would be possible if the md5sum
 > >      file paths were appended to compressed backups. This would add very
 > >     little to the storage but it would allow you to very easily
 > >     determine the right link. If so then you could just read the link
 > >     path from the file. 
 > 
 > I believe this would speed up BackupPC_tarPCCopy by many orders of magnitude.
 > 
 > > 2. Then rsync *just* the pool -- this should be no problem since by
 > >    definition there are no hard links within the pool itself
 > > 
 > > 3. Finally, run through the list generated in #1 to create the new pc
 > >    directory by creating the necessary links (and for files with no
 > >    hard links, just copy/rsync them)
 > 
 > See BackupPC_tarPCCopy.
 > 
 > > The above could also be easily adapted to allow for "incremental" syncing.
 > > Specifically, in #1, you would use rsync to just generate a list of
 > > *changed* files in the pc directory. In #2, you would continue to use
 > > rsync to just sync *changed* pool entries. In #3 you would only act on
 > > the shortened incremental sync list generated in #1.
 > 
 > While you can limit your pc/ directory traversal to only a subset of all
 > backups of a host (or all hosts, if you give a start date, for example), I
 > don't quite see how syncing the pool should work. Remember that pool files
 > with hash collisions may be renumbered. Is this supposed to be limited to
 > re-creating an identical pool? Even then, renaming a pool file does not 
 > affect
 > the pc/ links to it. Overwriting it with different content does. You would
 > need to re-establish the correct links for existing backups too, or figure 
 > out
 > how the source pool was changed and replicate the changes to the destination
 > pool (rm foo_2 foo_3; mv foo_4 foo_2). This can be done, but not with rsync,
 > as far as I can tell.

Ahhhh... I see your point. Rsync won't realize that links have been
renumbered but will instead treat them as new files.

Maybe do the following:
- Use rsync with dry-run flag to identify pool files that have changed
  (based on size, name, time). Then sort the list by md5sum path
  Then work through the list as follows:
  - If new md5sum path, then just copy over to the backup pool
  - If existing md5sum path then we need to deal with the following
        subcases:
                - New hash collision
                - Renumbering
                - New/changed rsync checksum or compression flag (but same
                  file content)
                - Deleted pool entry
    To separate the cases, I would calculate the full file md5sum of
        the content. Then renumber, then copy over new additions and
        delete deletions. Then check for changes to compression flags &
        rsync checksums.

The key is that this can all be decoupled from the pc tree since the
link itself is unchanged for any entry in the pc tree that stays.

After this incremental sync of the pool, then you proceed to
incrementally backup the pc tree. In general, this is nicely
structured since a backup is either added in total or deleted in total
(though this gets more complicated and would require a file-by-file
examination of changes (or rsync) if you had used a routine like my
BackupPC_deleteFile to modify or delete individual files from past
backups)

- If a backup has been deleted, then simply 'rm -rf' it (along with
  its logs) from the pc tree 
- If a backup has been added, then use the methods described above to
  link in the new files to the already synchronized pool.

Note that a potentially better idea would require structuring the pool
naming convention to get rid of then need to renumber chains. Instead
of giving hash collisions an ordinal subscript, jsut give each pool
entry a name composed of the concatenation of the path md5sum plus
another random 32 digit number. Then you would never have to renumber
since their would be essentially a zero percent chance of collision.
In that case you would be able to do a simple rsync on the pool.
If
 > 
 > > The more I think about it, the more I LIKE the idea of appending the
 > > md5sums file paths to compressed pool files (Method #3)
 > 
 > Yes, but the question is how often this information is needed. We're going to
 > a lot of trouble to *eliminate* redundancy. Adding redundancy for a case 
 > every
 > 100th user is going to encounter once in his life may not be warranted. Then
 > again, it's a fixed amount per file and probably not enough to worry about 
 > ...
It could be made optional whether or not to append this additional
information (similar to how rsync checksums are optional).

More generally, since the latest rsync protocol uses md5sums anyway
(instead of md4), why not just bite the bullet and start using the
full md5sum as the pool file name since we will be needing to
calculate it anyway for protocol 30 rsync. This will get rid of hash
collisions (for all intents and purposes) and also give us a built in
checksum to use against file corruption (or changes). If the md5sum is
then stored similar to how the rsync file digests are now stored, we
won't need to add any storage and you could then use my Method #3.

Hash collisions are really a pita since they introduce a bit of
awkwardness. Getting rid of them would simplify the BackupPC_link
algorithm and would eliminate the need for pool renumbering.

Along these lines, I really wish that BackupPC didn't wait for the
second backup of a file to store the rsync digest information since it
would be nice to have the checksums always there in case you ever need
to verify file integrity.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/