BackupPC-users

Re: [BackupPC-users] Linux backups with rsync vs tar

2011-09-02 10:52:10
Subject: Re: [BackupPC-users] Linux backups with rsync vs tar
From: Timothy J Massey <tmassey AT obscorp DOT com>
To: backuppc-users AT lists.sourceforge DOT net
Date: Fri, 2 Sep 2011 10:43:37 -0400
charlesboyo <backuppc-forum AT backupcentral DOT com> wrote on 08/31/2011 05:53:43 AM:

> I'm using BackupPC to take daily backups of a maildir totaling 250
> GB with average file sizes of 500 MB (text mailboxes, these files
> change everyday).
> Currently, my setup take full backups once a week and incremental
> backups every day between the full backups. The servers are directly
> connected with a cross-cable, allowing 100 Mbps.


I have a very similar setup with several servers.  They are often connected using 100Mb/s just because the clients haven't upgraded to Gb switches.  Also, they back up IBM Lotus Domino servers.  In Domino, each mail user has their own mail database which is typically Gigabytes big (except with this thing called DAMO, but even then they're still hundreds of MB big).  This is pretty comparable to your environment, though my *total* size is not usually 250GB of just mail data...  I have file servers that are bigger, but not mail servers.

(I have some servers that back up Microsoft Exchange servers.  This is even worse:  one monolithic file for the *ENTIRE* mailstore.  U G L Y...  And incrementals *ARE* fulls!  :) )

> However, these backups take about 8 hours to complete, averaging 8
> Mbps and the BackupPC server is CPU-bound through-out the entire
> process.


Fulls or incrementals or both?  If truly 90% of your files are changing daily, I'm going to assume both.  There will be *very* little difference between a full backup and an incremental.

> Thus I have reason to suspect the rsync overhead as being guilty.
> Note that I have disabled hard links, implemented checksum caching,
> increased the block size to 512 KB and enable --whole-file to no avail.


I have done zero tuning of the rsync command:  I use 100% stock BackupPC command line for it.

> 1. since over 90% of the files change every day and "incremental"
> backups involve transferring the whole file to the BackupPC server,
> won't it make better sense to just run a full backup everyday?


Incremental backups end up with a whole new file, but when using rsync it does not do it by transferring the whole file.  The rsync protocol works on sending just the changed parts of the file.  HOWEVER, the whole file is read on *BOTH* ends of the connection, so it doesn't save you a *BIT* of disk I/O:  it only saves you NETWORK I/O.  Seeing as you have only 100Mb/s between them, that will improve performance, but not tremendously dramatically, and like you have found it exacts a CPU hit in order to do this.

You may find that trading CPU performance for network performance may not be a good trade in your case.  Having said that, I run BackupPC on about the slowest systems you can actually buy new today:  VIA EPIA EN 1500 system boards with 512MB RAM.  Terrible performance, but meet my BackupPC needs just *fine*.

Hard numbers on the nearest Domino server to me:  60GB total backed up for full, 18GB for incremental (this is a DAOS server).  Fulls take about 150 minutes, incrementals take about 40.  1/4 the data, 1/4 the time.  And that's on the miserable hardware I described.

Scaling that up to your sizes, that would take about 600 minutes, or 10 hours.  So, the 8 hours that you're seeing sounds reasonable.

The number one question I have is:  is this really a problem?  If you have a backup window that allows this, I would not worry about it.  If you do *not*, then rsync might not be for you.

To address a couple of things said in other replies:

1) Avoiding building a file list is pointless.  It takes my servers just a couple of minutes.  It may certainly use RAM, but that is only an issue if you have millions of files.  And in that case, simply add more RAM.  I'm a glutton for punishment running with 512MB of RAM (and actually, I use 2GB in new servers now:  I just like to twist Les' tail!  :) ).

2) Les' point about the format of the files (one monolithic file for each mailbox vs. one file per e-mail) is dead on.  That allows 99% of the files to remain untouched once they're backed up *once*.  That will *vastly* reduce the backup times.  (That DAOS thing does a similar thing for Domino by breaking out attachments into individual files, and hashing and pooling them in a manner very similarly to a BackupPC pool, BTW.  Before DAOS, my fulls and incrementals were indistinguishable, now they're 4:1 size-wise.  Plus a 50% reduction in total disk usage.  But I digress.)

However, be aware that now you substitute the "my backups are taking a long time and don't pool" problem with a "now I have to manage several *MILLION* files!" problem.  fsck can become a major issue in that case--with 250GB of e-mail, even ls can be a major issue!  Both have advantages and disadvantages.  Just be aware that it's not a clear win either way.

And you might not have a choice, making the argument moot.


Now, for tar.  Take my information with a grain of salt:  I have *never* run tar with BackupPC...

> 2. from Pavel's questions, he observed that BackupPC is unable to
> recover from interrupted tar transfer. Such interruptions simply
> cannot happen in my case. Should I switch to tar?


Is that a trick question?  "This cannot happen.  Should I do this?"  Umm.  No -- GIVEN the conditions you yourself set.  :)

http://en.wikipedia.org/wiki/Tautology_%28logic%29

> And in the
> unlikely event that the transferred does get interrupted, what
> mechanisms do I need to implement to resume/recover from the failure?


To repeat another response:  restart the backup...

> 3. What is the recommended process for switching from rsync to tar -
> since the format/attributes are reportedly incompatible? I would
> like to preserve existing compressed backups as much as possible.


Your old backups should be 100% fine.  They will remain in the pool just fine, etc.  I do not believe that files transferred by rsync will pool with files transferred by tar (due to the attribute issue you mention);  however, for you that's a moot point:  90% of your files don't pool, anyway.

As an aside, BackupPC (well, the pooling) buys you virtually *nothing* in your application.  With a fast enough network connection, rsync buys everyone almost *nothing*, too.  You are using two tools that have very distinct advantages, but you're using them in an environment that largely ignores their advantages.

This is not a *bad* thing.  Every single one of my backup servers is based on BackupPC, and all but maybe 2 shares are backed up using rsync.  (The only exceptions I can think of are where I'm backing up data on a NAS, and I can't or won't run rsyncd on the NAS so I have to use SMB).  Whether it's an advantage or disadvantage, that's the setup I use.  I vastly prefer consistency over performance.  But I can live with 8 hour backup windows.

If you can't, then you may have to make different decisions.  That's the fun of being the Administrator! :)

Timothy J. Massey
 
Out of the Box Solutions, Inc.
Creative IT Solutions Made Simple!

http://www.OutOfTheBoxSolutions.com
tmassey AT obscorp DOT com
      22108 Harper Ave.
St. Clair Shores, MI 48080
Office: (800)750-4OBS (4627)
Cell: (586)945-8796

------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
<Prev in Thread] Current Thread [Next in Thread>