Bacula-users

Re: [Bacula-users] Network error with FD during Backup: ERR=Connection reset by peer

2012-10-02 06:20:06
Subject: Re: [Bacula-users] Network error with FD during Backup: ERR=Connection reset by peer
From: "DAHLBOKUM Markus (FPT INDUSTRIAL)" <markus.dahlbokum AT fptindustrial DOT com>
To: "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 2 Oct 2012 12:15:47 +0200

> 

> >I did a couple of installations and I never faced with this error

> 

> >before. Anyway, never say never again.

> 

> >In the first scenario we were backing up to tape for a few years and

> 

> >then migrated to a disc based solution. Everything worked like a charm.

> 

> >This particular problem occurred first, when we migrated the "problem

> 

> >server" from a physical machine to a virtualized one (with VMware

> 

> >converter). As I mentioned in the reply to Josh, there is another

> 

> >virtual server on this host without any problems.

> 

> >

> 

> >Has anyone probably issues with nic drivers, too. I used a mix of E1000

> 

> >or "flexible" in the vm config.

> 

> >

> 

> >However, can someone tell me, where the problem has its origin. Is it

> 

> >the FD, SD or the Dir? It's not clear for me.

> 

> Hi Michael,

> 

> I might have a similar problem. We also used Bacula for years and now

> migrated our main server to VMware.

> 

> In the first 3 month everything worked fine but after the summer shut

> down I saw the broken pipe error.

> 

> My configuration:

> 

> On the storage server, a huge disk storage is attached. Here only the

> file daemon is running. (VMware)

> 

> On the backup server the director and the storage daemons are running.

> (physical server)

> 

> OS is in both cases Ubuntu 12.04 64 bit.

> 

> Kernel: 3.2.0-27

> 

> Bacula taken from the Ubuntu packages: 5.2.5-0ubuntu6.1

> 

> We don?t use a tape changer, and the weekly full backup needs 2 tapes.

> The job starts at Saturday and normally waits for the second tape

> which I change on Monday morning.

> 

> But since the shutdown the network is reset after exactly 15 minutes

> and the job stops with a broken pipe error.

> 

> I have added the heartbeat interval on all daemons, but no change.

> 

> What is a little suspicious, is that when I reschedule the job during

> the week, the job waits for the tape 1, 2 or three days without a

> problem. When it starts on weekends, error!

> 

> In my case it might be an update of our switch?s firmware. Some other

> guy from IT updated all switches. Next weekend I will be able to test

> my backup with the old firmware again. Perhaps this is was the reason

> in my case.

> 

> Did you have any changes in your network environment?

> 

> 

>I seem to remember someone on here having this problem previously.

>Bacula daemons all set socket option SO_KEEPALIVE to keep the

>connections from timing out, but a switch in between was not properly

>honoring the TCP keepalive. When the switch times out the connection,

>both FD and DIR then think the other side closed the connection.

> 

>However, Michael mentioned that on the second scenario all servers are

>on the same hypervisor and there is no switch. Maybe the place to start

>is to move the failing VM to the other hypervisor and see if it still

>fails. Perhaps there is some difference in the VMWare configs.

> 

> I will post if the firmware was the problem.

> 

> Regarding your question which daemon is causing the trouble, is there

> really no output which daemon get the error. In my case it?s the

> communication between the FD on the VMware-server and the SD.

> 

> 25-Aug 16:18 ttl010-sd JobId 31: Job backup4.2012-08-25_09.08.00_15 is

> waiting. Cannot find any appendable volumes.

> 

> Please use the "label" command to create a new Volume for:

> 

> Storage: "Drive-1" (/dev/nst0)

> 

> Pool: Pool-backup4

> 

> Media type: LTO-4

> 

> 25-Aug 16:33 ttl011-fd JobId 31: Error: bsock.c:389 Write error

> sending 65536 bytes to Storage daemon:160.220.129.201:9103:

> ERR=Connection timed out

> 

> 25-Aug 16:33 ttl011-fd JobId 31: Fatal error: backup.c:1190 Network

> send error to SD. ERR=Connection timed out

> 

> 25-Aug 16:33 ttl010-sd JobId 31: Error: bsock.c:389 Write error

> sending -6 bytes to client:160.220.129.203:36643: ERR=Connection reset

> by peer

> 

> 25-Aug 16:33 ttl010-dir JobId 31: Error: Bacula ttl010-dir 5.2.5

> (26Jan12):

> 

> Regards,

> Markus

 

 

I now could check if bacula fd to sd connection timed out because of the network switches. This was not the case. My job still cancels.

 

What I did now was to check if the heartbeat is really working. So I installed wireshark and tracked my network connections.

I see my traymonitor connecting every 5 sec to dir, sd and fd. But I can’t see any heartbeat between my two servers. There should be something every 5 sec, too.

 

Can someone tell me how and when the heartbeat should occur? Is it active when no job is running?

In my config I set the following line for dir, sd and fd:

Heartbeat Interval = 5

This should result in a heartbeat every 5 sec?

 

I’m thankful for every help I can get.

Regards,
Markus

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users