BackupPC-users

Re: [BackupPC-users] Problems with hardlink-based backups...

2009-08-30 01:12:14
Subject: Re: [BackupPC-users] Problems with hardlink-based backups...
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Sun, 30 Aug 2009 01:07:59 -0400
dan wrote at about 21:24:56 -0600 on Saturday, August 22, 2009:
 > Unfortunately, every backup option you have has some limitations or
 > imperfections.  Hardlinks have thier pros and cons.  Really, there are only
 > a few ways of doing incremental managed backups.  Hardlinks, Diff files, and
 > Diff file lists, SQL.  Hardlinks are nice because they are inexpensive.
 > Looking at the directory contents of a backup that is using hard links
 > requires no overhead because of the hardlinks.  Diff files and Diff file
 > lists(one being where a diff is taken of each individual file and only the
 > changes are stored and a diff file list being only storing those files that
 > have changed) requires an algoryth to recurse other directories that hold
 > the real data and overlay the backup on the previous one.
 > 
 > The only option that is more efficient than hardlinks would really be
 > storing files in SQL and also storing an MD5, then linking the rows in SQL.
 > Very similar to a hardlink but instead its just a row pointer.  This would
 > be many times faster than doing hardlinks in a filesystem because SQL
 > selects in a heirarchy based on significant data.  It would be like backuppc
 > only having one host with one backup on it when you are looking at the web
 > interface.  All the other hosts and backups etc are already excluded.
 > 
 > SQL file storage for backuppc has been discussed extensively on this list
 > and suffice it to say that opinions are very split and for good reason.
 > SQL(mysql specifcally but applies to all) is much much better at some tasks
 > than a traditional filesystem(searching for data!, orders of magnitude
 > faster) but a filesystem is also much much better at simply storing files.
 > Some hybrid could take into account the pros of each such as storing all of
 > the pointer data in mysql and storing the actual files as their MD5 names on
 > a filesystem.  simply md5 a file, push the md5 off to mysql with the host
 > and backup date, filename, and file path and write the file to the
 > filesystem.  Incremental backups would MD5 a file, search the database for
 > the MD5, if found then write a pointer to that entry and if not write a new
 > entry for the MD5 of the file, the hostname, file path and file name , and
 > the backup number(or date).  All the files would just be stored as their MD5
 > name.  Recovering the files would be less transparent but would only require
 > an SQL to pull the list of files based on hostname and backup number and
 > then pull those files, renamed, into a zip or tar file.
 > 
That is exactly the hybrid that I have been advocating... But as you
mentioned, some like it and some don't...

 > 
 > 
 > On Mon, Aug 17, 2009 at 5:52 AM, David <wizzardx AT gmail DOT com> wrote:
 > 
 > > Hi there.
 > >
 > > Firstly, this isn't a backuppc-specific question, but it is of
 > > relevance to backup-pc users (due to backuppc architecture), so there
 > > might be people here with insight on the subject (or maybe someone can
 > > point me to a more relevant project or mailing list).
 > >
 > > My problem is as follows... with backup systems based on complete
 > > hardlink-based snapshots, you often end up with a large number of
 > > hardlinks. eg, at least one per server file, per backup generation,
 > > per file.
 > >
 > > Now, this is fine most of the time... but there is a problem case that
 > > comes up because of this.
 > >
 > > If the servers you're backing up, themselves have a huge number of
 > > files (like, hundreds of thousands or millions even), that means that
 > > you end up making a huge number of hardlinks on your backup server,
 > > for each backup generation.
 > >
 > > Although inefficient in some ways (using up a large number of inode
 > > entries in the filesystem tables), this can work pretty nicely.
 > >
 > > Where the real problem comes in, is if admins want to use 'updatedb',
 > > or 'du' on the linux system. updatedb gets a *huge* database and uses
 > > up tonnes of cpu & ram  (so, I usually disable it). And 'du' can take
 > > days to run, and make multi-gb files.
 > >
 > > Here's a question for backuppc users (and people who use hardlink
 > > snapshot-based backups in general)... when your backup server, that
 > > has millions of hardlinks on it, is running low on space, how do you
 > > correct this?
 > >
 > > The most obvious thing is to find which host's backups are taking up
 > > the most space, and then remove some of the older generations.
 > >
 > > Normally the simplest method to do this, is to run a tool like 'du',
 > > and then perhaps view the output in xdiskusage. (One interesting thing
 > > about 'du', is that it's clever about hardlinks, so doesn't count the
 > > disk usage twice. I think it must keep a table in memory of visited
 > > inodes, which had a link count of 2 or greater).
 > >
 > > However, with a gazillion hardlinks, du takes forever to run, and has
 > > a massive output. In my case, about 3-4 days, and about 4-5 GB output
 > > file.
 > >
 > > My current setup is a basic hardlink snapshot-based backup scheme, but
 > > backuppc (due to it's pool structure, where hosts have generations of
 > > hardlink snapshot dirs) would have the same problems.
 > >
 > > How do people solve the above problem?
 > >
 > > (I also imagine that running "du" to check disk usage of backuppc data
 > > is also complicated by the backuppc pool, but at least you can exclude
 > > the pool from the "du" scan to get more usable results).
 > >
 > > My current fix is an ugly hack, where I go through my snapshot backup
 > > generations (from oldest to newest), and remove all redundant hard
 > > links (ie, they point to the same inodes as the same hardlink in the
 > > next-most-recent generation). Then that info goes into a compressed
 > > text file that could be restored from later. And after that, compare
 > > the next 2-most-recent generations and so on.
 > >
 > > But yeah, that's a very ugly hack... I want to do it better and not
 > > re-invent the wheel. I'm sure this kind of problem has been solved
 > > before.
 > >
 > > fwiw, I was using rdiff-backup before. It's very du-friendly, since
 > > only the differences between each backup generation is stored (rather
 > > than a large number of hardlinks). But I had to stop using it, because
 > > with servers with a huge number of files it uses up a huge amount of
 > > memory + cpu, and takes a really long time. And the mailing list
 > > wasn't very helpful with trying to fix this, so I had to change to
 > > something new so that I could keep running backups (with history).
 > > That's when I changed over to a hardlink snapshots approach, but that
 > > has other problems, detailed above. And my current hack (removing all
 > > redundant hardlinks and empty dir structures) is kind of similar to
 > > rdiff-backup, but coming from another direction.
 > >
 > > Thanks in advance for ideas and advice.
 > >
 > > David.
 > >
 > >
 > > ------------------------------------------------------------------------------
 > > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
 > > trial. Simplify your report design, integration and deployment - and focus
 > > on
 > > what you do best, core application coding. Discover what's new with
 > > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
 > > _______________________________________________
 > > BackupPC-users mailing list
 > > BackupPC-users AT lists.sourceforge DOT net
 > > List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > > Wiki:    http://backuppc.wiki.sourceforge.net
 > > Project: http://backuppc.sourceforge.net/
 > >
 > 
 > ----------------------------------------------------------------------
 > ------------------------------------------------------------------------------
 > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
 > trial. Simplify your report design, integration and deployment - and focus 
 > on 
 > what you do best, core application coding. Discover what's new with 
 > Crystal Reports now.  http://p.sf.net/sfu/bobj-july
 > ----------------------------------------------------------------------
 > _______________________________________________
 > BackupPC-users mailing list
 > BackupPC-users AT lists.sourceforge DOT net
 > List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:    http://backuppc.wiki.sourceforge.net
 > Project: http://backuppc.sourceforge.net/

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/