Bacula-users

Re: [Bacula-users] Issue with Network error on channel and speed...

2008-05-19 16:21:50
Subject: Re: [Bacula-users] Issue with Network error on channel and speed...
From: Arno Lehmann <al AT its-lehmann DOT de>
To: undisclosed-recipients: ;
Date: Mon, 19 May 2008 22:21:20 +0200
Hi,

18.05.2008 03:19, Javier Gomez wrote:
>     We have a bacula 2.2.8 server environment running on a Fedora 8 
> server.  It has 2 core processors.  2 gigs of memory.  We are currently 
> backing up about 200 servers ranging in size from 10 gigs up to about 
> 600 gigs of used space.  We do not use tapes at our facility.  All 
> backups are performed to File devices.  In general Bacula has proven to 
> be much more stable then any other solution we have used (thank you).  

Up till here: A robust backup solution.

> We allow about 35 to 45 backups to run concurrently each night.

This is quite a bit.

>  We have 
> an offsite backup location running the Bacula environment with a 150 Meg 
> point to point connection to our main facility where all of the 
> production servers are located.  We have tested our lines and we do not 
> seem to be maxing out the 150 meg fiber connection.

Just to make sure - this is 150 MBits, right?

>  The connection 
> seems fairly stable (losing a single ping packet every once and a while, 
> otherwise its within a average of 5 ms from point to point.  We use a 
> Cisco ASA 5520 and a few Cisco switches between the two points for 
> communication (all new equipment).  My issue is that we have what seems 
> like very slow backups (averaging 200 K bytes/second to the max of 
> around 2.5 M/second), but in general all of the servers are sitting 
> around the 500 K bytes/second.  I seem to get this same speed if I am 
> running 40 backups concurrently or just one, so the speed does not seem 
> to be based on the volume across the WAN connection.

Then this looks like it's the limit your Bacula installation handles - 
this is really slow then.

>     Then to make matters worst we seem to get a number of the following 
> network errors noted below each night.

Ok, this is probably a different problem.

>  We have network monitoring 
> software watching the data lines and the Cisco equipment on both ends 
> and we don't see any network issues (none that are obvious).  We have 
> had many situations were a number of backups will fail with this same 
> error within the same 3 seconds which would make me think there was a 
> network connection issue to the backup server,  But at the same time 
> that those backups failed, another 15 were still actively running and 
> completed just fine.  That made me think it was something with the 
> Bacula SD locking it from time to time, but I have not seen any 
> references to any issues.  The failed backup will work if we rerun the 
> backup so its not a basic configuration issue.  I have set up the 
> Heartbeat in the SD and the FD configurations to 300 (That helped to 
> deal with the 2 hour timeout issues with most routers), but nothing 
> seems to clean up the nightly errors we get like the one below.
> 
> ------------------------
> 17-May 14:56 bacula001 JobId 8883: Fatal error: Network error with FD 
> during Backup: ERR=Connection reset by peer
> 17-May 14:56 bacula001 JobId 8883: Job ServerA 7.2008-05-16_21.05.34 
> marked to be canceled.
> 17-May 14:56 bacula001 JobId 8883: Fatal error: append.c:259 Network 
> error on data channel. ERR=Connection reset by peer
> 17-May 14:56 bacula001 JobId 8883: Job write elapsed time = 17:45:38, 
> Transfer rate = 374.9 K bytes/second
> 17-May 14:56 bacula001 JobId 8883: Fatal error: No Job status returned 
> from FD.

Ok, this is probably an FD issue.

Which OS do you run on the affected clients? If they are all running 
the same you should start looking there (experience shows that Windows 
network drivers for certain hardware can be a bit... tricky. The 
NVidia ones especially...)

> ------------------------
> 
>     Does anyone have any ideas on what I can do to help prevent these 
> types of network errors as well as improve speed?  Or is there any more 
> debugging type settings that I can set which would help me to track 
> these issues down.

Best start looking at these two things as separate issues.

Regarding the network problems, try to verify if it affects one OS 
only - perhaps only one patch level, network driver version, network 
hardware, or something. It might have to do with local firewalling, too.

The speed issue is a bit more complicated to start with, I believe.

First try to see what speed you achieve if you only run one full 
backup, and if speed degrades seriously if you run an incremental. 
That alone could explain a lot - incremental backups are more or less 
limited by the I/O performance of the client machines.

Check the time the clients spend for I/O (vmstat under linux, for 
example) and what the DIR and SD do during backups. The catalog 
database can limit the speed, too. Assuming you have the catalog 
database on the DIR machine, running top would show you what it does 
during backup.

If neither client I/O, database throughput, or network performance are 
easily spotted as bottlenecks, you should give us a bit more 
information from simpler test cases:
Start with throughput measurements on the SD machine (preferrably 
using btape). If that is much better than what you achieve, look at 
the network link (try different network buffer sizes, and measure the 
network throughput without bacula, for example using dd through netcat).
Again, if that doesn't seem to be the problem, run a test setup on the 
DIR machine only, backing up from local disk, to local disk, and see 
if that goes faster.

I hope this gives you a new starting point...

Arno

>              Thanks for any type of help that can be given...
>                       Javier
> 

-- 
Arno Lehmann
IT-Service Lehmann
www.its-lehmann.de

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>