dan wrote at about 21:24:56 -0600 on Saturday, August 22, 2009:
> Unfortunately, every backup option you have has some limitations or
> imperfections. Hardlinks have thier pros and cons. Really, there are only
> a few ways of doing incremental managed backups. Hardlinks, Diff files, and
> Diff file lists, SQL. Hardlinks are nice because they are inexpensive.
> Looking at the directory contents of a backup that is using hard links
> requires no overhead because of the hardlinks. Diff files and Diff file
> lists(one being where a diff is taken of each individual file and only the
> changes are stored and a diff file list being only storing those files that
> have changed) requires an algoryth to recurse other directories that hold
> the real data and overlay the backup on the previous one.
>
> The only option that is more efficient than hardlinks would really be
> storing files in SQL and also storing an MD5, then linking the rows in SQL.
> Very similar to a hardlink but instead its just a row pointer. This would
> be many times faster than doing hardlinks in a filesystem because SQL
> selects in a heirarchy based on significant data. It would be like backuppc
> only having one host with one backup on it when you are looking at the web
> interface. All the other hosts and backups etc are already excluded.
>
> SQL file storage for backuppc has been discussed extensively on this list
> and suffice it to say that opinions are very split and for good reason.
> SQL(mysql specifcally but applies to all) is much much better at some tasks
> than a traditional filesystem(searching for data!, orders of magnitude
> faster) but a filesystem is also much much better at simply storing files.
> Some hybrid could take into account the pros of each such as storing all of
> the pointer data in mysql and storing the actual files as their MD5 names on
> a filesystem. simply md5 a file, push the md5 off to mysql with the host
> and backup date, filename, and file path and write the file to the
> filesystem. Incremental backups would MD5 a file, search the database for
> the MD5, if found then write a pointer to that entry and if not write a new
> entry for the MD5 of the file, the hostname, file path and file name , and
> the backup number(or date). All the files would just be stored as their MD5
> name. Recovering the files would be less transparent but would only require
> an SQL to pull the list of files based on hostname and backup number and
> then pull those files, renamed, into a zip or tar file.
>
That is exactly the hybrid that I have been advocating... But as you
mentioned, some like it and some don't...
>
>
> On Mon, Aug 17, 2009 at 5:52 AM, David <wizzardx AT gmail DOT com> wrote:
>
> > Hi there.
> >
> > Firstly, this isn't a backuppc-specific question, but it is of
> > relevance to backup-pc users (due to backuppc architecture), so there
> > might be people here with insight on the subject (or maybe someone can
> > point me to a more relevant project or mailing list).
> >
> > My problem is as follows... with backup systems based on complete
> > hardlink-based snapshots, you often end up with a large number of
> > hardlinks. eg, at least one per server file, per backup generation,
> > per file.
> >
> > Now, this is fine most of the time... but there is a problem case that
> > comes up because of this.
> >
> > If the servers you're backing up, themselves have a huge number of
> > files (like, hundreds of thousands or millions even), that means that
> > you end up making a huge number of hardlinks on your backup server,
> > for each backup generation.
> >
> > Although inefficient in some ways (using up a large number of inode
> > entries in the filesystem tables), this can work pretty nicely.
> >
> > Where the real problem comes in, is if admins want to use 'updatedb',
> > or 'du' on the linux system. updatedb gets a *huge* database and uses
> > up tonnes of cpu & ram (so, I usually disable it). And 'du' can take
> > days to run, and make multi-gb files.
> >
> > Here's a question for backuppc users (and people who use hardlink
> > snapshot-based backups in general)... when your backup server, that
> > has millions of hardlinks on it, is running low on space, how do you
> > correct this?
> >
> > The most obvious thing is to find which host's backups are taking up
> > the most space, and then remove some of the older generations.
> >
> > Normally the simplest method to do this, is to run a tool like 'du',
> > and then perhaps view the output in xdiskusage. (One interesting thing
> > about 'du', is that it's clever about hardlinks, so doesn't count the
> > disk usage twice. I think it must keep a table in memory of visited
> > inodes, which had a link count of 2 or greater).
> >
> > However, with a gazillion hardlinks, du takes forever to run, and has
> > a massive output. In my case, about 3-4 days, and about 4-5 GB output
> > file.
> >
> > My current setup is a basic hardlink snapshot-based backup scheme, but
> > backuppc (due to it's pool structure, where hosts have generations of
> > hardlink snapshot dirs) would have the same problems.
> >
> > How do people solve the above problem?
> >
> > (I also imagine that running "du" to check disk usage of backuppc data
> > is also complicated by the backuppc pool, but at least you can exclude
> > the pool from the "du" scan to get more usable results).
> >
> > My current fix is an ugly hack, where I go through my snapshot backup
> > generations (from oldest to newest), and remove all redundant hard
> > links (ie, they point to the same inodes as the same hardlink in the
> > next-most-recent generation). Then that info goes into a compressed
> > text file that could be restored from later. And after that, compare
> > the next 2-most-recent generations and so on.
> >
> > But yeah, that's a very ugly hack... I want to do it better and not
> > re-invent the wheel. I'm sure this kind of problem has been solved
> > before.
> >
> > fwiw, I was using rdiff-backup before. It's very du-friendly, since
> > only the differences between each backup generation is stored (rather
> > than a large number of hardlinks). But I had to stop using it, because
> > with servers with a huge number of files it uses up a huge amount of
> > memory + cpu, and takes a really long time. And the mailing list
> > wasn't very helpful with trying to fix this, so I had to change to
> > something new so that I could keep running backups (with history).
> > That's when I changed over to a hardlink snapshots approach, but that
> > has other problems, detailed above. And my current hack (removing all
> > redundant hardlinks and empty dir structures) is kind of similar to
> > rdiff-backup, but coming from another direction.
> >
> > Thanks in advance for ideas and advice.
> >
> > David.
> >
> >
> > ------------------------------------------------------------------------------
> > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> > trial. Simplify your report design, integration and deployment - and focus
> > on
> > what you do best, core application coding. Discover what's new with
> > Crystal Reports now. http://p.sf.net/sfu/bobj-july
> > _______________________________________________
> > BackupPC-users mailing list
> > BackupPC-users AT lists.sourceforge DOT net
> > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
> > Wiki: http://backuppc.wiki.sourceforge.net
> > Project: http://backuppc.sourceforge.net/
> >
>
> ----------------------------------------------------------------------
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus
> on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now. http://p.sf.net/sfu/bobj-july
> ----------------------------------------------------------------------
> _______________________________________________
> BackupPC-users mailing list
> BackupPC-users AT lists.sourceforge DOT net
> List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
> Wiki: http://backuppc.wiki.sourceforge.net
> Project: http://backuppc.sourceforge.net/
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
|