BackupPC-users

Re: [BackupPC-users] Problems with hardlink-based backups...

2009-08-19 08:02:12
Subject: Re: [BackupPC-users] Problems with hardlink-based backups...
From: Adam Goryachev <mailinglists AT websitemanagers.com DOT au>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Wed, 19 Aug 2009 21:58:09 +1000
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David wrote:
> On Tue, Aug 18, 2009 at 5:35 PM, Les Mikesell<lesmikesell AT gmail DOT com> 
> wrote:
>> Why not just exclude the _TOPDIR_ - or the mount point if this is on its
>> own filesystem?
> Because most of the interesting files on the backup server (at least
> in my case), are the files being backed up. I'm a lot more interested
> in being able to quickly find those files, than random stuff under
> /etc, /usr, etc.

Yes, and this is something I'd like to have in backuppc (please find a
file on any host, in any backup number, with the string abc in it's
filename). This isn't possible without using the standard tools like
find, and waiting for it to traverse all the directories and backups
etc.. (well, you could use grep on the logfiles to find it, which would
probably be faster)...

>> There's not a good way to figure out which files might be in all of your
>> backups and thus not help space-wise when you remove any instance(s) of
>> it.  But the per-host, per-run stats where you can see the rate of new
>> files being picked up and how much they compress is very helpful.
> Thanks for this info. At least with per-host stats, it's easier to
> narrow down where to run du if I need to, instead of over the entire
> backup partition.
> 
> A couple of random questions:
> 
> 1. How well does BackupPC work when you manually make changes to the
> pool behind it's back? (like removing a host, or some of the host's
> history, via the command line). Can you make it "resync/repair" it's
> database?

Removing hosts, or individual backups doesn't affect the pool, and in my
experience, this works just fine. Although I would advise against doing
it, simply because you never know exactly what might get stuffed up....

I've had a remote client rename about 10G of images, so I simply did a
cp -al from the previous full backup into the current partial (aborted
full) backup, and then continued the full backup. It then noticed all
the old filenames were gone, found the new filenames were already
downloaded (hardlinked really), and continued on nicely.
I've also deleted individual files (vmware disk image files, dvd images,
etc) and not had a problem.

Of course, if you are going to do things like that, you should try and
use the tools that have recently been written to help do this properly.

> 2) Is there a recommended approache for "backing up" BackupPC databases?
> In case they go corrupt and so on. Or is a simple rsync safe?

Stop backuppc, umount the partition, and use dd to copy to another
partition, or else use RAID1 with three members, stop backuppc, umount,
remove a member, and you have your backup.

Rsync *should* work fine for smaller pools/number of files, as long as
you have lots of RAM on both ends.... Eventually, you will get a pool
size (number of files) where it will stop working...

> 3) Is it possible to use BackupPC's logic on the command-line, with a
> bunch of command-line arguments, without setting up config files?

No, not really.

> That would be awesome for scripting and so on, for people who want to
> use just parts of it's logic (like the pooled system for instance),
> rather than the entire backup system. I tend to prefer that kind of
> "unix tool" design.

You really sound like a programmer <EG> (yes I have read the rest of
your post already)...

After configuring backuppc, there are some things you can do to
basically cancel out all the automated features of backuppc and just use
it's pieces manually. Though I think if you actually used backuppc
normally first, you would be unlikely to want to do this.

>> Of course, but you do it by starting with a smaller number of runs than
>> you expect to be able to hold.  Then after you see that the space
>> consumed is staying stable you can adjust the amount of history to keep.
> 
> Ah right. I think this is a fundamental difference in approach. With
> the backup systems I've used before, space usage is going to keep
> growing forever, until you take steps to fix it. Either manually, or
> by some kind of scripting, and so far I haven't added scripting, so I
> rely on du to know where to manually recover space.
> 
> Basically, I was using rdiff-backup for along time. That tool keeps
> all the history, until you run it with a command-line argument to
> prune the oldest revisions.

You specify in advance how many incremental and full backups you want,
what period you want to keep them on, etc. Then backuppc *can*
automatically prune the relevant backups to keep what you have asked
for. One specific point is that you can keep your daily (incremental)
backups for the past month, then every second one for two months, and
all fulls (weekly) for the past 6 months, every 4th full for the past
two years, etc...

> And also, I don't see a great need to pro-actively recover space most
> of the time. The large majority of servers/users/etc have a relatively
> small amount of change. So it's kind of cool to be able to get *any*
> of the earlier daily snapshots, for the last few years.

I never recover space on any of my backuppc servers either, but
sometimes I increase the number of backups I want to keep :) Yes, some
things are cool, but they are rarely useful... However, I have one
customer whose backuppc server keeps *every* backup it has ever
completed, and that has been running for over 3 years now.

> Although ironically, the servers with the largest amount of churn (and
> harddrive usage on backup server), are the ones you'd actually want to
> keep old versions for (like yearlies, monthlies, etc). But with
> rdiff-backup, that isn't really possible without some major repo
> surgery :-). You end up throwing away all the oldest versions when
> space runs low.

Which is the problem with those tools. Sometimes you want to keep the
backup from 7 years ago, but you don't really need every daily backup
for the past 7 years! This is where backuppc is quite helpful...

> Also, I'm influenced by revision control tools, like git/svn/etc. I
> don't like to throw away old versions, unless it's really necessary.

When it is necessary, do you want to always throw away the oldest
version though ?

> And, if you have a lot of harddrive space on the backup server, then
> may as well actually make use of it, to store as many versions as
> possible. And then only remove oldest versions where needed.

Again, you might not want to remove the oldest, you might want to remove
some of the in between backups...

> The above backup philosophy (based partly on rdiff-backup limitations)
> has served me well so far, but I guess I need to unlearn some of it,
> particularly if I want to use a hardlink-based backup system.

Or just get more disk space...

> If rsync is used, then what is the difference between an incremental
> and a full backup?

Basically, the full will read every file on the client and backuppc
server, and compare checksums. The incremental will skip this full
checksum comparison.

> ie, do  "full" backups copy all the data over (if using rsync), or
> just the changed files?

No, both full and incremental will only transfer the modified portions
of the modified files (if using rsync).

> And, what kind of disadvantage is there if you only do (rsync-based)
> incrementals and don't ever make full backups?

In the older versions (which my above client started with, and this is
the config I started with), an incremental backup would compare the
remote client with the last *full* backup, so over time, you needed to
transfer more and more data over the network. In current versions, you
can backup compared to the last incremental of a lower level (not sure
how many levels you can get, but you can do
[0,1,0,0,2,1,1,3,2,2,4,3,3,5,4,4,6] etc.. or whatever you like... not
sure how many entries can be included there.

After working out how this affected backuppc (along with the huge amount
of extra work to "fill in" the backups in the web interface), I just
configured full backups every 3 days. The only real difference between a
full and incremental is the amount of IO load and CPU load on the client
(and backuppc server), and hence the time it takes to complete a backup.
You really should schedule a regular full backup anyway.

Also, another reason for regular full backups is so you don't need to
keep every full backup, you can drop every second (or every fourth etc)
backup to recover space...

> My angle is that Linux sysadmins have certain tools they like to use,
> and saying they can't use them effectively due to the backup
> architecture is kind of problematic.

It isn't that they can't be used... they are just slow, and there are
more efficient methods to obtain the same information. I could use find
or grep or du on my massive maildir's, but they suck and there are other
methods to get some of the answers I need, other times, I have to use
du/find/etc...

> Probably I need to think more about using a more traditional scheme
> (keep a fixed number of backups, X daily, Y weekly, Z monthly, etc),
> instead of "keep versions forever, until you need to start recovering
> harddrive space".

You can still keep versions forever, just set the keepcnt values to very
high values... 15 years, or 50 years, etc... The difference is with
backuppc you have more flexibility on *which* backups you remove to
recover space... Consider the common case of a growing log file, you
backup every day, and the file is rotated each month. So, you have 30
versions of the same file, yet you don't really need 29 of them since
all the data is included in the last/30th one... etc.. lots of examples
I'm sure you can think of :)

> But the problem I see is this:
> 
> (From BackupPC docs)
> 
> "Therefore, every file in the pool will have at least 2 hard links
> (one for the pool file and one for the backup file below
> __TOPDIR__/pc). Identical files from different backups or PCs will all
> be linked to the same file. When old backups are deleted, some files
> in the pool might only have one link. BackupPC_nightly checks the
> entire pool and removes all files that have only a single link,
> thereby recovering the storage for that file."
> 
> Therefore, if you want to keep tonnes of history (like, every day for
> the past 3 years), for a server with lots of files, then it sounds
> like you need to actually have a huge number of filesystem entries.

Yes, but is that a problem?

With 5 hosts being backed up, I have 401 full backups, and 3303
incremental backups, using 36TB of storage prior to pooling and
compression. (ie, if we didn't have hardlinks or compression).

We have approx 1.9M unique files in the pool using only 680GB of disk space.

I'm not sure how to calculate the actual number of inodes used... (df -i
doesn't seem to work as we are using reiserfs, I'm sure you would get
major issues doing this on ext2/3 etc..)

> I think if I wanted to use BackupPC, and still be able to use du and
> friends effectively, I'd need to do some combination of:
> 
> 1) Use incrementals for most of the backups, to limit the number of
> hardlinks created, as Les Mikesell described.
> 
> 2) Stop trying to keep history for every single day for years (rather
> keep 1 for the last X days, last Y weeks, Z months, etc).

or just be more patient with how long those tools take to run, and
realise that they might stop working one day if your pool/etc gets too
big...

> This would also mean having to spend less time managing space.
> Although at the moment it only comes up every few weeks/months, and
> had been pretty fast with du & xdiskusage, at least until I switched
> over from rdiff-backup to a "make a hardlink snapshot every day"
> process :-(.

or just get more disk space :)

> And furthermore, hardlink-based storage does cause ambiguous du
> output, even if the time it took to run wasn't an issue. Which is
> another thing about hardlink-based backups which annoys me (compared
> to when I was using rdiff-backup), and one of the reasons why I'm
> currently running my own very hackish "de-duping" script on our backup
> server.

Or is it that you don't know the right tool for this job which annoys
you (a little sarcasm :)...

> Nice that BackupPC maintains these stats separately. Although kind of
> annoying (imo), that you have to go through it's frontend to see this
> info, rather than being able to tell from standard linux commands (for
> scripting purposes and so on).

As far as I know, the format of the files this information is stored in
is well documented, and as such you could write scripts to your hearts
content to read/parse this simple text files, and get any information
you desire...

> And also it bothers me that those kind of stats can potentially go out
> of synch with the harddrive (maybe you delete part of the pool by
> mistake).

Ummm, don't make mistakes :) or if you do, fix the stats...

> Is there a way to make BackupPC "repair" it's database, by re-scanning
> it's pool? Or some kind of recommended procedure for fixing problems
> like this?

I am pretty sure there is no such tools... you either live with it until
the relevant backups are purged, or you manually stuff around,
potentially making the problem even worse (ie, messing it up in a way
that you don't know you have messed it up, as opposed to knowing it is
wrong).

>> As a side note are you letting available space dictate you retention
>> policy?  It sounds like you don't want to fund the retention policiy
>> you've specified otherwise you wouldn't be out of disk space.  Buy
>> more disk or reduce your retention numbers for backups.
> And since we have a fairly large backup server (compared to the
> servers being backed up), I let the older backups build up for a while
> to take advantage of the space, and then free a chunk of space
> manually when the scripts email me about space issues.
> 
> But now I can't "free a chunk of space manually" that easily any more,
> since "du" doesn't work :-(.

rm -rf TopDir/pc/host/nnn where nnn is a random incr backup number or a
full backup which no remaining incr relies on it seems to work pretty
well. Though I'd advise adjusting the values in the config file and
letting backuppc purge the backups itself.

> Well, the good news is that nobody here seems to care about the
> backups much, until the moment they're needed. The fact we have them
> at all is kind of a bonus D:. At least I'm starting to get the boss
> (we're a pretty small company) on my side. Just that nobody besides
> myself has time to work on things like this.

Once you lose all the data, everybody will have plenty of time :) You
can't afford not to have good backups! (But hey, *we* all know that....)

One other thing that should be considered, the point of using backuppc
is that lots of other people use it, and have checked that there is no
bugs etc in it. As such, we are somewhat certain that we will get back
the correct data as long as we treat it correctly (don't fiddle with
it's storage behind it's back)... Home grown scripts/programs can be
hugely rewarding/etc, but you will never get the same
reliability/certainty about the software. Of course, you also have to
write all the improvements yourself, instead of just downloading the new
version that someone else was nice enough to write for you :)

> PS: Random question: Does backuppc have tools for making offsite,
> offline backups? Like copying a subset of the recent BackupPC backups
> over to a set of external drives (in encrypted format) and then taking
> the drives home or something like that.

Yes, you can archive backups... One of my customers plugs in a esata
drive, crontab runs a script to mount the drive, create the tar files of
the most recent backups onto a staging (internal raid array) area,
delete the files from the external disk, and then copy the new tar files
onto the esata, and finally delete the files from the staging area...

Lots of checks/etc to make sure we are doing the correct things, and
alerts (or OK's) are reported back to the monitoring system as needed.

> Or alternately, are there recommended tools for this? I made a script
> for this, but want to see how people here usually handle this.

This is where custom scripts/plugins are best utilised. A single program
can't determine the possible needs of every user.... :)

I hope the above information is useful to you, please note it is just my
wordy opinion, and probably hardly worth the electrons used to display
it. Please recycle them thoughtfully...

Regards,
Adam
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkqL6NAACgkQGyoxogrTyiUAfwCfbrQU8HrY4NgcYzihRuv1kMLs
HOsAnjFVA/ALzyrQtJZKwaLTnSREvDmu
=ANHr
-----END PGP SIGNATURE-----

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/