>
> >I did a couple of installations and I never faced with this error
>
> >before. Anyway, never say never again.
>
> >In the first scenario we were backing up to tape for a few years and
>
> >then migrated to a disc based solution. Everything worked like a charm.
>
> >This particular problem occurred first, when we migrated the "problem
>
> >server" from a physical machine to a virtualized one (with VMware
>
> >converter). As I mentioned in the reply to Josh, there is another
>
> >virtual server on this host without any problems.
>
> >
>
> >Has anyone probably issues with nic drivers, too. I used a mix of E1000
>
> >or "flexible" in the vm config.
>
> >
>
> >However, can someone tell me, where the problem has its origin. Is it
>
> >the FD, SD or the Dir? It's not clear for me.
>
> Hi Michael,
>
> I might have a similar problem. We also used Bacula for years and now
> migrated our main server to VMware.
>
> In the first 3 month everything worked fine but after the summer shut
> down I saw the broken pipe error.
>
> My configuration:
>
> On the storage server, a huge disk storage is attached. Here only the
> file daemon is running. (VMware)
>
> On the backup server the director and the storage daemons are running.
> (physical server)
>
> OS is in both cases Ubuntu 12.04 64 bit.
>
> Kernel: 3.2.0-27
>
> Bacula taken from the Ubuntu packages: 5.2.5-0ubuntu6.1
>
> We don?t use a tape changer, and the weekly full backup needs 2 tapes.
> The job starts at Saturday and normally waits for the second tape
> which I change on Monday morning.
>
> But since the shutdown the network is reset after exactly 15 minutes
> and the job stops with a broken pipe error.
>
> I have added the heartbeat interval on all daemons, but no change.
>
> What is a little suspicious, is that when I reschedule the job during
> the week, the job waits for the tape 1, 2 or three days without a
> problem. When it starts on weekends, error!
>
> In my case it might be an update of our switch?s firmware. Some other
> guy from IT updated all switches. Next weekend I will be able to test
> my backup with the old firmware again. Perhaps this is was the reason
> in my case.
>
> Did you have any changes in your network environment?
>
>
>I seem to remember someone on here having this problem previously.
>Bacula daemons all set socket option SO_KEEPALIVE to keep the
>connections from timing out, but a switch in between was not properly
>honoring the TCP keepalive. When the switch times out the connection,
>both FD and DIR then think the other side closed the connection.
>
>However, Michael mentioned that on the second scenario all servers are
>on the same hypervisor and there is no switch. Maybe the place to start
>is to move the failing VM to the other hypervisor and see if it still
>fails. Perhaps there is some difference in the VMWare configs.
>
> I will post if the firmware was the problem.
>
> Regarding your question which daemon is causing the trouble, is there
> really no output which daemon get the error. In my case it?s the
> communication between the FD on the VMware-server and the SD.
>
> 25-Aug 16:18 ttl010-sd JobId 31: Job backup4.2012-08-25_09.08.00_15 is
> waiting. Cannot find any appendable volumes.
>
> Please use the "label" command to create a new Volume for:
>
> Storage: "Drive-1" (/dev/nst0)
>
> Pool: Pool-backup4
>
> Media type: LTO-4
>
> 25-Aug 16:33 ttl011-fd JobId 31: Error: bsock.c:389 Write error
> sending 65536 bytes to Storage daemon:160.220.129.201:9103:
> ERR=Connection timed out
>
> 25-Aug 16:33 ttl011-fd JobId 31: Fatal error: backup.c:1190 Network
> send error to SD. ERR=Connection timed out
>
> 25-Aug 16:33 ttl010-sd JobId 31: Error: bsock.c:389 Write error
> sending -6 bytes to client:160.220.129.203:36643: ERR=Connection reset
> by peer
>
> 25-Aug 16:33 ttl010-dir JobId 31: Error: Bacula ttl010-dir 5.2.5
> (26Jan12):
>
> Regards,
> Markus
I now could check if bacula fd to sd connection timed out because of the network switches. This was not the case. My job still cancels.
What I did now was to check if the heartbeat is really working. So I installed wireshark and tracked my network connections.
I see my traymonitor connecting every 5 sec to dir, sd and fd. But I can’t see any heartbeat between my two servers. There should be something every 5 sec, too.
Can someone tell me how and when the heartbeat should occur? Is it active when no job is running?
In my config I set the following line for dir, sd and fd:
Heartbeat Interval = 5
This should result in a heartbeat every 5 sec?
I’m thankful for every help I can get.
Regards,
Markus