Hi,
18.05.2008 03:19, Javier Gomez wrote:
> We have a bacula 2.2.8 server environment running on a Fedora 8
> server. It has 2 core processors. 2 gigs of memory. We are currently
> backing up about 200 servers ranging in size from 10 gigs up to about
> 600 gigs of used space. We do not use tapes at our facility. All
> backups are performed to File devices. In general Bacula has proven to
> be much more stable then any other solution we have used (thank you).
Up till here: A robust backup solution.
> We allow about 35 to 45 backups to run concurrently each night.
This is quite a bit.
> We have
> an offsite backup location running the Bacula environment with a 150 Meg
> point to point connection to our main facility where all of the
> production servers are located. We have tested our lines and we do not
> seem to be maxing out the 150 meg fiber connection.
Just to make sure - this is 150 MBits, right?
> The connection
> seems fairly stable (losing a single ping packet every once and a while,
> otherwise its within a average of 5 ms from point to point. We use a
> Cisco ASA 5520 and a few Cisco switches between the two points for
> communication (all new equipment). My issue is that we have what seems
> like very slow backups (averaging 200 K bytes/second to the max of
> around 2.5 M/second), but in general all of the servers are sitting
> around the 500 K bytes/second. I seem to get this same speed if I am
> running 40 backups concurrently or just one, so the speed does not seem
> to be based on the volume across the WAN connection.
Then this looks like it's the limit your Bacula installation handles -
this is really slow then.
> Then to make matters worst we seem to get a number of the following
> network errors noted below each night.
Ok, this is probably a different problem.
> We have network monitoring
> software watching the data lines and the Cisco equipment on both ends
> and we don't see any network issues (none that are obvious). We have
> had many situations were a number of backups will fail with this same
> error within the same 3 seconds which would make me think there was a
> network connection issue to the backup server, But at the same time
> that those backups failed, another 15 were still actively running and
> completed just fine. That made me think it was something with the
> Bacula SD locking it from time to time, but I have not seen any
> references to any issues. The failed backup will work if we rerun the
> backup so its not a basic configuration issue. I have set up the
> Heartbeat in the SD and the FD configurations to 300 (That helped to
> deal with the 2 hour timeout issues with most routers), but nothing
> seems to clean up the nightly errors we get like the one below.
>
> ------------------------
> 17-May 14:56 bacula001 JobId 8883: Fatal error: Network error with FD
> during Backup: ERR=Connection reset by peer
> 17-May 14:56 bacula001 JobId 8883: Job ServerA 7.2008-05-16_21.05.34
> marked to be canceled.
> 17-May 14:56 bacula001 JobId 8883: Fatal error: append.c:259 Network
> error on data channel. ERR=Connection reset by peer
> 17-May 14:56 bacula001 JobId 8883: Job write elapsed time = 17:45:38,
> Transfer rate = 374.9 K bytes/second
> 17-May 14:56 bacula001 JobId 8883: Fatal error: No Job status returned
> from FD.
Ok, this is probably an FD issue.
Which OS do you run on the affected clients? If they are all running
the same you should start looking there (experience shows that Windows
network drivers for certain hardware can be a bit... tricky. The
NVidia ones especially...)
> ------------------------
>
> Does anyone have any ideas on what I can do to help prevent these
> types of network errors as well as improve speed? Or is there any more
> debugging type settings that I can set which would help me to track
> these issues down.
Best start looking at these two things as separate issues.
Regarding the network problems, try to verify if it affects one OS
only - perhaps only one patch level, network driver version, network
hardware, or something. It might have to do with local firewalling, too.
The speed issue is a bit more complicated to start with, I believe.
First try to see what speed you achieve if you only run one full
backup, and if speed degrades seriously if you run an incremental.
That alone could explain a lot - incremental backups are more or less
limited by the I/O performance of the client machines.
Check the time the clients spend for I/O (vmstat under linux, for
example) and what the DIR and SD do during backups. The catalog
database can limit the speed, too. Assuming you have the catalog
database on the DIR machine, running top would show you what it does
during backup.
If neither client I/O, database throughput, or network performance are
easily spotted as bottlenecks, you should give us a bit more
information from simpler test cases:
Start with throughput measurements on the SD machine (preferrably
using btape). If that is much better than what you achieve, look at
the network link (try different network buffer sizes, and measure the
network throughput without bacula, for example using dd through netcat).
Again, if that doesn't seem to be the problem, run a test setup on the
DIR machine only, backing up from local disk, to local disk, and see
if that goes faster.
I hope this gives you a new starting point...
Arno
> Thanks for any type of help that can be given...
> Javier
>
--
Arno Lehmann
IT-Service Lehmann
www.its-lehmann.de
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|