BackupPC-users

Re: [BackupPC-users] Problems with hardlink-based backups...

2009-08-19 06:40:57
Subject: Re: [BackupPC-users] Problems with hardlink-based backups...
From: David <wizzardx AT gmail DOT com>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Wed, 19 Aug 2009 12:37:29 +0200
Thanks for the replies.

Firstly, I think I should reiterate a few things I mentioned in the first post.

I haven't actually used BackupPC yet, mainly read through it's docs,
and trying to judge how well it and it's storage system would work in
our environment.

I'm mainly asking questions on this list first, to get an idea of how
well it handles the kind of issues I've experienced so far (with
things like hardlinks to huge filesystems), before I spend more time
playing with BackupPC and looking into migrating our backups to it.

And like I said before, this isn't a BackupPC-specific complaint, more
a general problem with hardlink-based backup systems (as opposed to
rdiffs, or various other schemes). So I'm checking how sysadmins
typically handle these kinds of issues.

Also, I'm not too experienced with backup "best practices",
methodologies, etc. Still learning, and seeing what works best. And
heh, our (relatively small) company didn't even have a real backup
system before, and I'm still the only person here that seems to take
them seriously >_>. Fortunately, the boss has started seeing the light
(after a near disaster in the server room), and acquired some more
hardware. But nobody besides me seems to have time to actually setup
things and make sure they're running. And I'm not even one of the
network admins/tech support, I'm actually a programmer and I was never
actually asked to work on the backups ^^; The actual network
admins/tech support don't really know much about backups D: (or have
time to work with them).

Anyway, hopefully the above will give you a better idea of my angle on
this. I'm not trying to criticize BackupPC, but rather figure out what
kind of backup scheme is going to work here (and be easy to
admin/diagnose/hack/etc), whether it is BackupPC, or something else
(that may or may not use hardlinks).

On Tue, Aug 18, 2009 at 5:35 PM, Les Mikesell<lesmikesell AT gmail DOT com> 
wrote:
>
> Why not just exclude the _TOPDIR_ - or the mount point if this is on its
> own filesystem?
>

Because most of the interesting files on the backup server (at least
in my case), are the files being backed up. I'm a lot more interested
in being able to quickly find those files, than random stuff under
/etc, /usr, etc.

>
> There's not a good way to figure out which files might be in all of your
> backups and thus not help space-wise when you remove any instance(s) of
> it.  But the per-host, per-run stats where you can see the rate of new
> files being picked up and how much they compress is very helpful.
>

Thanks for this info. At least with per-host stats, it's easier to
narrow down where to run du if I need to, instead of over the entire
backup partition.

A couple of random questions:

1. How well does BackupPC work when you manually make changes to the
pool behind it's back? (like removing a host, or some of the host's
history, via the command line). Can you make it "resync/repair" it's
database?

2) Is there a recommended approache for "backing up" BackupPC databases?

In case they go corrupt and so on. Or is a simple rsync safe?

3) Is it possible to use BackupPC's logic on the command-line, with a
bunch of command-line arguments, without setting up config files?

That would be awesome for scripting and so on, for people who want to
use just parts of it's logic (like the pooled system for instance),
rather than the entire backup system. I tend to prefer that kind of
"unix tool" design.

>
> Of course, but you do it by starting with a smaller number of runs than
> you expect to be able to hold.  Then after you see that the space
> consumed is staying stable you can adjust the amount of history to keep.
>

Ah right. I think this is a fundamental difference in approach. With
the backup systems I've used before, space usage is going to keep
growing forever, until you take steps to fix it. Either manually, or
by some kind of scripting, and so far I haven't added scripting, so I
rely on du to know where to manually recover space.

Basically, I was using rdiff-backup for along time. That tool keeps
all the history, until you run it with a command-line argument to
prune the oldest revisions.

And also, I don't see a great need to pro-actively recover space most
of the time. The large majority of servers/users/etc have a relatively
small amount of change. So it's kind of cool to be able to get *any*
of the earlier daily snapshots, for the last few years.

Although ironically, the servers with the largest amount of churn (and
harddrive usage on backup server), are the ones you'd actually want to
keep old versions for (like yearlies, monthlies, etc). But with
rdiff-backup, that isn't really possible without some major repo
surgery :-). You end up throwing away all the oldest versions when
space runs low.

Also, I'm influenced by revision control tools, like git/svn/etc. I
don't like to throw away old versions, unless it's really necessary.

And, if you have a lot of harddrive space on the backup server, then
may as well actually make use of it, to store as many versions as
possible. And then only remove oldest versions where needed.

The above backup philosophy (based partly on rdiff-backup limitations)
has served me well so far, but I guess I need to unlearn some of it,
particularly if I want to use a hardlink-based backup system.

>
> One other thing - backuppc only builds a complete tree of links for full
> backups which by default run once a week with incrementals done on the
> other days.  Incremental runs build a tree of directories but only the
> new and changed files are populated, with a notation for deletions.  The
> web browser and restore processes merge the backing full on the fly and
> the expire process knows not to remove fulls until the incrementals that
> depend on it have expired as well.  That, and the file compression might
> take care of most of your problems.

Ah, very interesting info, thanks. I read the info on incrementals in
the docs, and mainly picked up that "rsync is a good thing" :-)

AA couple of questions, pardon my noobiness:

If rsync is used, then what is the difference between an incremental
and a full backup?

ie, do  "full" backups copy all the data over (if using rsync), or
just the changed files?

And, what kind of disadvantage is there if you only do (rsync-based)
incrementals and don't ever make full backups?

On Tue, Aug 18, 2009 at 5:49 PM, Jon Craig<cannedspam.cant AT gmail DOT com> 
wrote:
> A personal desire on your part to use a specific tool to get
> information that is presented in other ways hardly constitues a
> problem with BackupPC.

Again, I'm not criticizing BackupPC specifically. And indeed it seems
that BackupPC has ways which can reduce the problem. Specifically
incremental backups, as opposed to a large number (hundreds/thousands)
of "full" snapshot directories, each containing a huge number of
hardlinks (possibly millions), for several such servers.

My angle is that Linux sysadmins have certain tools they like to use,
and saying they can't use them effectively due to the backup
architecture is kind of problematic.

I guess though, that the philosophy behind rdiff-backup (keep every
single version, until you want to start removing oldest) isn't really
compatible with BackupPC, or other schemes that keep an actual
filesystem entry for every version of every file, even when there are
no changes in those files.

Probably I need to think more about using a more traditional scheme
(keep a fixed number of backups, X daily, Y weekly, Z monthly, etc),
instead of "keep versions forever, until you need to start recovering
harddrive space".

> The linking structure within BackupPC is the
> "magic" behind deduping files.  That it creates a huge number of
> directory entries with a resulting smaller number of inode entries is
> the whole point.

Yeah, I like that. But the problem I see is this:

(From BackupPC docs)

"Therefore, every file in the pool will have at least 2 hard links
(one for the pool file and one for the backup file below
__TOPDIR__/pc). Identical files from different backups or PCs will all
be linked to the same file. When old backups are deleted, some files
in the pool might only have one link. BackupPC_nightly checks the
entire pool and removes all files that have only a single link,
thereby recovering the storage for that file."

Therefore, if you want to keep tonnes of history (like, every day for
the past 3 years), for a server with lots of files, then it sounds
like you need to actually have a huge number of filesystem entries.

I think if I wanted to use BackupPC, and still be able to use du and
friends effectively, I'd need to do some combination of:

1) Use incrementals for most of the backups, to limit the number of
hardlinks created, as Les Mikesell described.

2) Stop trying to keep history for every single day for years (rather
keep 1 for the last X days, last Y weeks, Z months, etc).

This would also mean having to spend less time managing space.
Although at the moment it only comes up every few weeks/months, and
had been pretty fast with du & xdiskusage, at least until I switched
over from rdiff-backup to a "make a hardlink snapshot every day"
process :-(.

> Use the status pages to determine where your space
> is going.  It gives you information about the apparent size (full size
> if you weren't de-duping") and the unique size (that portion of each
> backup that was new.  This information is a whole lot more useful that
> whatever your gonna get from DU.  DU takes so long because its a dumb
> tool that does what its told and you are in effect telling it to
> iterate accross each server multiple times (1 per retained backup) for
> each server you backup.  If you did this against the actual clients
> the time would be similiar to doing it against BackupPC's topdir.

And furthermore, hardlink-based storage does cause ambiguous du
output, even if the time it took to run wasn't an issue. Which is
another thing about hardlink-based backups which annoys me (compared
to when I was using rdiff-backup), and one of the reasons why I'm
currently running my own very hackish "de-duping" script on our backup
server.

Nice that BackupPC maintains these stats separately. Although kind of
annoying (imo), that you have to go through it's frontend to see this
info, rather than being able to tell from standard linux commands (for
scripting purposes and so on).

And also it bothers me that those kind of stats can potentially go out
of synch with the harddrive (maybe you delete part of the pool by
mistake).

Is there a way to make BackupPC "repair" it's database, by re-scanning
it's pool? Or some kind of recommended procedure for fixing problems
like this?

>
> As a side note are you letting available space dictate you retention
> policy?  It sounds like you don't want to fund the retention policiy
> you've specified otherwise you wouldn't be out of disk space.  Buy
> more disk or reduce your retention numbers for backups.
>

More like, there wasn't a backup or retention policy to begin with D:.
I hacked together some scripts that use rdiff-backup and other tools,
and then added them to the backup server crontab.

And since we have a fairly large backup server (compared to the
servers being backed up), I let the older backups build up for a while
to take advantage of the space, and then free a chunk of space
manually when the scripts email me about space issues.

But now I can't "free a chunk of space manually" that easily any more,
since "du" doesn't work :-(.

At least thanks to the discussions in this thread, I have a few more
ideas for my own scripts, even if I don't use BackupPC in the end.

> Look at the Host Summary page.  Those servers with the largest "Full
> Size" or a disspoportionate number of retained fulls/incrementals are
> the hosts to focus pruning efforts on. Now select a candidate and

Ah, thanks. This is very useful info. So you can find which
files/transfers/etc caused a given host to use a huge amount of
storage.

> Voila', you've put your system on a diet, but beware, you do this once
> and management will expect you to keep solving their under resourced
> backup infrastructure by doing it again and again.

Well, the good news is that nobody here seems to care about the
backups much, until the moment they're needed. The fact we have them
at all is kind of a bonus D:. At least I'm starting to get the boss
(we're a pretty small company) on my side. Just that nobody besides
myself has time to work on things like this.

Anyway, thanks again for the replies. This thread has been educational
so far :-)

David.

PS: Random question: Does backuppc have tools for making offsite,
offline backups? Like copying a subset of the recent BackupPC backups
over to a set of external drives (in encrypted format) and then taking
the drives home or something like that.

Or alternately, are there recommended tools for this? I made a script
for this, but want to see how people here usually handle this.

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/