Bacula-users

Re: [Bacula-users] jobs fail with various "broken pipe" errors

2012-03-06 07:46:52
Subject: Re: [Bacula-users] jobs fail with various "broken pipe" errors
From: Silver Salonen <silver AT serverock DOT ee>
To: bacula-users <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 06 Mar 2012 14:44:33 +0200
On Tuesday 28 February 2012 16:07:50 Christopher Hylarides wrote:
> I'm not sure why (I haven't had the need to dig this deep), but with 
> large backups (well all of them, really) bacula-dir connects to the FD, 
> then the FD starts doing stuff while the DIR still maintains the 
> connection.  So it could be timing out after half an hour and then later 
> when the DIR tries to write again it fails.
> 
> This is why i tuned the TCP *keepalive* to 15 seconds from the solaris 
> default of 2 hours.  This is exactly what happened to me.  I'd start a 
> large backup, and without question if failed at 2.5 hours.
> 
> See also:
> http://leaf.dragonflybsd.org/mailarchive/commits/2008-03/msg00166.html
> 
> You what you probably want to do is forcefully enable tcp keepalives and 
> have them go every minute or so.  It may not even be your firewall 
> timing out.  My machines were on the same LAN.

I set TCP keepalive to 15 seconds on my Bacula server, but it did not change a 
thing.

Additionally I downgraded Bacula server to 5.0, but fortunately it seems it did 
not help either (meaning the problem is not a regression in 5.2).

I was able to solve some problems though.
We have multiple clients in the same environment, but in different VLANs, all 
being behind pfSense firewall. Before DIR connected to clients through external 
addresses and "port reflection" (whatever it means in pfSense). When I changed 
external addresses to internal ones, the DIR--FD timeouts are gone.

So I guess the remaining FD--SD timeouts are somehow caused by pfSense firewall 
too. I'll keep digging.

PS. Please post your replies below of the quoted text in mailing lists :)

--
Silver

> On 12-02-27 10:23 AM, Silver Salonen wrote:
> > On Monday 27 February 2012 09:29:13 Christopher Hylarides wrote:
> >> I had a similar issue that was solved by tweaking my TCP-keepalives at
> >> the kernel level that my director was on (in my case Solaris).
> >>
> >> My case was on a LAN, but with over 300GB.  It would fail at exactly the
> >> same time.
> >
> > Hi.
> >
> > Thanks for the information. We use FreeBSD-based PF firewalls and all the 
> > timeout values are on default in there and none of them is less than 15s:
> >
> > tcp.first 120s
> > tcp.opening 30s
> > tcp.established 86400s
> > tcp.closing 900s
> > tcp.finwait 45s
> > tcp.closed 90s
> > tcp.tsdiff 30s
> >
> > Any more guesses? May it be some hardware-related stuff?
> >
> > --
> > Silver
> >
> >>
> >> On 12-02-25 9:21 AM, Silver Salonen wrote:
> >>> On Thu, 23 Feb 2012 10:49:55 -0500, Josh Fisher wrote:
> >>>> On 2/23/2012 4:11 AM, Silver Salonen wrote:
> >>>>> On Wednesday 22 February 2012 15:20:10 Silver Salonen wrote:
> >>>>>
> >>>>> What's also interesting about these failures are these lines
> >>>>> (similar in all these failing jobs):
> >>>>>      FD Files Written:       381
> >>>>>      SD Files Written:       0
> >>>>>      FD Bytes Written:       391,430,239 (391.4 MB)
> >>>>>      SD Bytes Written:       0 (0 B)
> >>>>>      Last Volume Bytes:      260 (260 B)
> >>>>>
> >>>>> And the actual volume file seems to contain all the data (its size
> >>>>> is 373MB).
> >>>>>
> >>>>> What can we conclude from that?
> >>>>> Does the failure/timeout/whatever occur after the FD--SD connection,
> >>>>> eg. when SD tries to communicate with DIR about the end of the job or
> >>>>> smth?
> >>>>
> >>>> Or does the Dir abort the job after a timeout/whatever occurs for the
> >>>> Dir->FD connection? Since the problem started after changing network
> >>>> environment, I suspect a switch or router is timing out the Dir->FD
> >>>> connection, perhaps when the FD is busy compressing a large file or
> >>>> something. Try turning compression off? Just a guess.
> >>>
> >>> Tried it. No changes :(
> >>>
> >>> --
> >>> Silver

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>