Re: [BackupPC-users] Problems with hardlink-based backups...

Unfortunately, every backup option you have has some limitations or imperfections. Hardlinks have thier pros and cons. Really, there are only a few ways of doing incremental managed backups. Hardlinks, Diff files, and Diff file lists, SQL. Hardlinks are nice because they are inexpensive. Looking at the directory contents of a backup that is using hard links requires no overhead because of the hardlinks. Diff files and Diff file lists(one being where a diff is taken of each individual file and only the changes are stored and a diff file list being only storing those files that have changed) requires an algoryth to recurse other directories that hold the real data and overlay the backup on the previous one.

The only option that is more efficient than hardlinks would really be storing files in SQL and also storing an MD5, then linking the rows in SQL. Very similar to a hardlink but instead its just a row pointer. This would be many times faster than doing hardlinks in a filesystem because SQL selects in a heirarchy based on significant data. It would be like backuppc only having one host with one backup on it when you are looking at the web interface. All the other hosts and backups etc are already excluded.

SQL file storage for backuppc has been discussed extensively on this list and suffice it to say that opinions are very split and for good reason. SQL(mysql specifcally but applies to all) is much much better at some tasks than a traditional filesystem(searching for data!, orders of magnitude faster) but a filesystem is also much much better at simply storing files. Some hybrid could take into account the pros of each such as storing all of the pointer data in mysql and storing the actual files as their MD5 names on a filesystem. simply md5 a file, push the md5 off to mysql with the host and backup date, filename, and file path and write the file to the filesystem. Incremental backups would MD5 a file, search the database for the MD5, if found then write a pointer to that entry and if not write a new entry for the MD5 of the file, the hostname, file path and file name , and the backup number(or date). All the files would just be stored as their MD5 name. Recovering the files would be less transparent but would only require an SQL to pull the list of files based on hostname and backup number and then pull those files, renamed, into a zip or tar file.

On Mon, Aug 17, 2009 at 5:52 AM, David <wizzardx AT gmail DOT com> wrote:

Hi there.

Firstly, this isn't a backuppc-specific question, but it is of
relevance to backup-pc users (due to backuppc architecture), so there
might be people here with insight on the subject (or maybe someone can
point me to a more relevant project or mailing list).

My problem is as follows... with backup systems based on complete
hardlink-based snapshots, you often end up with a large number of
hardlinks. eg, at least one per server file, per backup generation,
per file.

Now, this is fine most of the time... but there is a problem case that
comes up because of this.

If the servers you're backing up, themselves have a huge number of
files (like, hundreds of thousands or millions even), that means that
you end up making a huge number of hardlinks on your backup server,
for each backup generation.

Although inefficient in some ways (using up a large number of inode
entries in the filesystem tables), this can work pretty nicely.

Where the real problem comes in, is if admins want to use 'updatedb',
or 'du' on the linux system. updatedb gets a *huge* database and uses
up tonnes of cpu & ram (so, I usually disable it). And 'du' can take
days to run, and make multi-gb files.

Here's a question for backuppc users (and people who use hardlink
snapshot-based backups in general)... when your backup server, that
has millions of hardlinks on it, is running low on space, how do you
correct this?

The most obvious thing is to find which host's backups are taking up
the most space, and then remove some of the older generations.

Normally the simplest method to do this, is to run a tool like 'du',
and then perhaps view the output in xdiskusage. (One interesting thing
about 'du', is that it's clever about hardlinks, so doesn't count the
disk usage twice. I think it must keep a table in memory of visited
inodes, which had a link count of 2 or greater).

However, with a gazillion hardlinks, du takes forever to run, and has
a massive output. In my case, about 3-4 days, and about 4-5 GB output
file.

My current setup is a basic hardlink snapshot-based backup scheme, but
backuppc (due to it's pool structure, where hosts have generations of
hardlink snapshot dirs) would have the same problems.

How do people solve the above problem?

(I also imagine that running "du" to check disk usage of backuppc data
is also complicated by the backuppc pool, but at least you can exclude
the pool from the "du" scan to get more usable results).

My current fix is an ugly hack, where I go through my snapshot backup
generations (from oldest to newest), and remove all redundant hard
links (ie, they point to the same inodes as the same hardlink in the
next-most-recent generation). Then that info goes into a compressed
text file that could be restored from later. And after that, compare
the next 2-most-recent generations and so on.

But yeah, that's a very ugly hack... I want to do it better and not
re-invent the wheel. I'm sure this kind of problem has been solved
before.

fwiw, I was using rdiff-backup before. It's very du-friendly, since
only the differences between each backup generation is stored (rather
than a large number of hardlinks). But I had to stop using it, because
with servers with a huge number of files it uses up a huge amount of
memory + cpu, and takes a really long time. And the mailing list
wasn't very helpful with trying to fix this, so I had to change to
something new so that I could keep running backups (with history).
That's when I changed over to a hardlink snapshots approach, but that
has other problems, detailed above. And my current hack (removing all
redundant hardlinks and empty dir structures) is kind of similar to
rdiff-backup, but coming from another direction.

Thanks in advance for ideas and advice.

David.

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july

_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/