BackupPC-users

Re: [BackupPC-users] Fairly large backuppc pool (4TB) moved with backuppc_tarpccopy

2011-10-03 13:56:17
Subject: Re: [BackupPC-users] Fairly large backuppc pool (4TB) moved with backuppc_tarpccopy
From: Holger Parplies <wbppc AT parplies DOT de>
To: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
Date: Mon, 3 Oct 2011 19:54:02 +0200
Hi,

Jeffrey J. Kosowsky wrote on 2011-10-02 23:37:07 -0400 [Re: [BackupPC-users] 
Fairly large backuppc pool (4TB) moved with(TAB)backuppc_tarpccopy]:
>  > My method has the downside that you need to sort a huge file (but the
>  > 'sort' command handles huge files rather well). Jeffrey's method has the
>  > downside that you have an individual file per inode with typically
>  > probably only a few hundred bytes of content, which might end up
>  > occupying 4K each - depending on file system. Also, traversing the tree
>  > should take longer, because each file is opened and closed multiple times
>  > - once per link to the inode it describes.
>  > Actually, a single big file has a further advantage. It's rather fast to
>  > look for something (like all non-pooled files) with a Perl script.
>  > Traversing an "ipool" is bound to take a similar amount of time as
>  > traversing pool or cpool will.
> 
> Holger, have you ever compared the time on actual data?

no. Actually, I don't consider my script as finished yet (and I haven't tried
yours). I've been developing mine out of interest for the matter, and because
I'd like to ultimately fix some things with one BackupPC installation I'm
responsible for. The problem there is that we needed to start over due to file
system problems once or twice, and I'd like to merge back the former backup
history I've kept for that purpose (which is more a proof-of-concept sort of
idea; we don't actually anticipate ever needing to *restore data* from the old
backups, thus it doesn't have high priority). The two goals I haven't found
time to find solutions for are:

1.) Merging of pools. Actually quite easy. When writing a pool file, use
    the logic from PoolWrite::write to match existing pool content or insert
    new pool files as appropriate. I'm saying "the logic of", because after
    each pool file I need to create the pc/ links, which means I can't wait
    for a link phase to create the actual pool link, because by then it might
    be necessary to replace the new file by a link to a pool file created in
    the mean time.
    For my special case, I also need to handle merging "backups" files and
    possibly renaming backups.

2.) Network capability. I'd like to generate one single data stream of some
    format (tar?), so I can pass that over arbitrary network transport (ssh,
    netcat, ...) for a remote copy operation. Currently, I use File::Copy,
    which, of course, limits it to local copies.

Together, these goals suggest that having an "incremental mode" would make
this a solution for offsite copies of BackupPC pools :-).

I was almost surprised when I looked at my code yesterday to find out that
local copies should actually work. I didn't remember finishing that part :).

> Just one nit, I do allow for caching the inode pool so frequently
> referenced pool files do not require the corresponding inode pool file
> to be opened repeatedly.

Well, ok, but how well can that work? You're limited to something like 1024
open files per process, I think. Can you do better than LRU? Depending on how
you iterate over the pool, that would tend to give you a low cache hit rate
(files tend to repeat in consecutive backups rather than within a single
backup, I'd guess, and single backups will easily have more than 1024 files).

> Also, I would think that with a large pool, that the file you
> construct would take up a fair bit of memory

Correct. I did try out the index generating phase (which is still available
without copying per option switches, as is copying without regenerating the
index file), and I got something like 2 GB of data. I'd say something like
100 Bytes per directory entry on the pool FS, as a *rough* estimate.
What does 'du -s' of your ipool give you?

> and that the O(n log n) to search it might take more time then referencing
> a hierarchically structured pool, especially if the file is paged to disk.

Also correct. But I don't *need* to search it. I construct it in a manner that
the sort operation will put the lines ("records") in an order where I just
have to linearly read the file and act on the lines one at a time. The only
information I need to remember between lines is inode number and path of *one
single* pool file (the last one encountered). Sorting the 2 GB file took a
matter of minutes.

> Of course, the above would depend on things like size of your memory,
> efficiency of file system access, cpu speed vs. disk access. Still would be
> curious though...

That's what I like about the solution. The only step that is likely to depend
on memory size is the sorting. Disk access is the key, as always when
processing a BackupPC pool. I don't see a way around that. But I *can* easily
keep the file off both source and destination pool disks if I want to. Default
is below the destination pool TopDir, because that is the place I can most
safely assume a large amount of free space. I could add logic to check, but I
believe this should really best be manually specified.

> Finally, did you ever post a working version of your script.

No. I don't consider it tested, really, so I wouldn't want the integrity of
someone's pool copy to depend on whether I had a good day or not :-). For the
relevant question *in this thread*, that doesn't seem to be a problem.
Collecting and sorting the data is straightforward enough, and if the results
turn out to be incorrect, it will only waste a small amount of time (mine,
probably :).
Furthermore, it's not really commented, copyrighted, the code isn't cleaned
up ... I can't even give you a synopsis or an option description without
looking closely at the code right now :-).

I'll send you a copy off-list, likewise to anyone else really interested, but
I'm not prepared to say "entrust all your pool data to this script" yet, not
even implicitly :-).

Regards,
Holger

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

<Prev in Thread] Current Thread [Next in Thread>