Bacula-users

Re: [Bacula-users] Network error with FD during Backup: ERR=Connection reset by peer

2012-09-27 11:23:26
Subject: Re: [Bacula-users] Network error with FD during Backup: ERR=Connection reset by peer
From: Josh Fisher <jfisher AT pvct DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Thu, 27 Sep 2012 11:20:41 -0400
On 9/27/2012 4:15 AM, DAHLBOKUM Markus (FPT INDUSTRIAL) wrote:
>
> >I did a couple of installations and I never faced with this error
>
> >before. Anyway, never say never again.
>
> >In the first scenario we were backing up to tape for a few years and
>
> >then migrated to a disc based solution. Everything worked like a charm.
>
> >This particular problem occurred first, when we migrated the "problem
>
> >server" from a physical machine to a virtualized one (with VMware
>
> >converter). As I mentioned in the reply to Josh, there is another
>
> >virtual server on this host without any problems.
>
> >
>
> >Has anyone probably issues with nic drivers, too. I used a mix of E1000
>
> >or "flexible" in the vm config.
>
> >
>
> >However, can someone tell me, where the problem has its origin. Is it
>
> >the FD, SD or the Dir? It's not clear for me.
>
> Hi Michael,
>
> I might have a similar problem. We also used Bacula for years and now 
> migrated our main server to VMware.
>
> In the first 3 month everything worked fine but after the summer shut 
> down I saw the broken pipe error.
>
> My configuration:
>
> On the storage server, a huge disk storage is attached. Here only the 
> file daemon is running. (VMware)
>
> On the backup server the director and the storage daemons are running. 
> (physical server)
>
> OS is in both cases Ubuntu 12.04 64 bit.
>
> Kernel: 3.2.0-27
>
> Bacula taken from the Ubuntu packages: 5.2.5-0ubuntu6.1
>
> We don’t use a tape changer, and the weekly full backup needs 2 tapes. 
> The job starts at Saturday and normally waits for the second tape 
> which I change on Monday morning.
>
> But since the shutdown the network is reset after exactly 15 minutes 
> and the job stops with a broken pipe error.
>
> I have added the heartbeat interval on all daemons, but no change.
>
> What is a little suspicious, is that when I reschedule the job during 
> the week, the job waits for the tape 1, 2 or three days without a 
> problem. When it starts on weekends, error!
>
> In my case it might be an update of our switch’s firmware. Some other 
> guy from IT updated all switches. Next weekend I will be able to test 
> my backup with the old firmware again. Perhaps this is was the reason 
> in my case.
>
> Did you have any changes in your network environment?
>

I seem to remember someone on here having this problem previously. 
Bacula daemons all set socket option SO_KEEPALIVE to keep the 
connections from timing out, but a switch in between was not properly 
honoring the TCP keepalive. When the switch times out the connection, 
both FD and DIR then think the other side closed the connection.

However, Michael mentioned that on the second scenario all servers are 
on the same hypervisor and there is no switch. Maybe the place to start 
is to move the failing VM to the other hypervisor and see if it still 
fails. Perhaps there is some difference in the VMWare configs.

> I will post if the firmware was the problem.
>
> Regarding your question which daemon is causing the trouble, is there 
> really no output which daemon get the error. In my case it’s the 
> communication between the FD on the VMware-server and the SD.
>
> 25-Aug 16:18 ttl010-sd JobId 31: Job backup4.2012-08-25_09.08.00_15 is 
> waiting. Cannot find any appendable volumes.
>
> Please use the "label" command to create a new Volume for:
>
> Storage: "Drive-1" (/dev/nst0)
>
> Pool: Pool-backup4
>
> Media type: LTO-4
>
> 25-Aug 16:33 ttl011-fd JobId 31: Error: bsock.c:389 Write error 
> sending 65536 bytes to Storage daemon:160.220.129.201:9103: 
> ERR=Connection timed out
>
> 25-Aug 16:33 ttl011-fd JobId 31: Fatal error: backup.c:1190 Network 
> send error to SD. ERR=Connection timed out
>
> 25-Aug 16:33 ttl010-sd JobId 31: Error: bsock.c:389 Write error 
> sending -6 bytes to client:160.220.129.203:36643: ERR=Connection reset 
> by peer
>
> 25-Aug 16:33 ttl010-dir JobId 31: Error: Bacula ttl010-dir 5.2.5 
> (26Jan12):
>
> Regards,
> Markus
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://ad.doubleclick.net/clk;258768047;13503038;j?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>
>
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>