Bacula-users

Re: [Bacula-users] jobs fail with various "broken pipe" errors

2012-03-12 10:00:13
Subject: Re: [Bacula-users] jobs fail with various "broken pipe" errors
From: Silver Salonen <silver AT serverock DOT ee>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 12 Mar 2012 15:58:20 +0200
On Tuesday 06 March 2012 14:44:33 Silver Salonen wrote:
> On Tuesday 28 February 2012 16:07:50 Christopher Hylarides wrote:
> > I'm not sure why (I haven't had the need to dig this deep), but with 
> > large backups (well all of them, really) bacula-dir connects to the FD, 
> > then the FD starts doing stuff while the DIR still maintains the 
> > connection.  So it could be timing out after half an hour and then later 
> > when the DIR tries to write again it fails.
> > 
> > This is why i tuned the TCP *keepalive* to 15 seconds from the solaris 
> > default of 2 hours.  This is exactly what happened to me.  I'd start a 
> > large backup, and without question if failed at 2.5 hours.
> > 
> > See also:
> > http://leaf.dragonflybsd.org/mailarchive/commits/2008-03/msg00166.html
> > 
> > You what you probably want to do is forcefully enable tcp keepalives and 
> > have them go every minute or so.  It may not even be your firewall 
> > timing out.  My machines were on the same LAN.
> 
> I set TCP keepalive to 15 seconds on my Bacula server, but it did not change 
> a thing.
> 
> Additionally I downgraded Bacula server to 5.0, but fortunately it seems it 
> did not help either (meaning the problem is not a regression in 5.2).
> 
> I was able to solve some problems though.
> We have multiple clients in the same environment, but in different VLANs, all 
> being behind pfSense firewall. Before DIR connected to clients through 
> external addresses and "port reflection" (whatever it means in pfSense). When 
> I changed external addresses to internal ones, the DIR--FD timeouts are gone.
> 
> So I guess the remaining FD--SD timeouts are somehow caused by pfSense 
> firewall too. I'll keep digging.
> 
> PS. Please post your replies below of the quoted text in mailing lists :)

So I've confirmed that what is to blame here is pfSense's port reflection.

>>From forums I've found that supposedly it means that the port redirection is 
>>done with netcat instead of PF (which is really hackish, even to pfSense's 
>>developers' minds) and netcat's TCP-timeout is 2000s by default. And it seems 
>>to be not possible to disable the timeout.

What is still not clear to me is why does DIR have to keep up the DIR--FD 
connection while FD is sending its data to SD. But well, at least the issue is 
worked around now.

--
Silver

> > On 12-02-27 10:23 AM, Silver Salonen wrote:
> > > On Monday 27 February 2012 09:29:13 Christopher Hylarides wrote:
> > >> I had a similar issue that was solved by tweaking my TCP-keepalives at
> > >> the kernel level that my director was on (in my case Solaris).
> > >>
> > >> My case was on a LAN, but with over 300GB.  It would fail at exactly the
> > >> same time.
> > >
> > > Hi.
> > >
> > > Thanks for the information. We use FreeBSD-based PF firewalls and all the 
> > > timeout values are on default in there and none of them is less than 15s:
> > >
> > > tcp.first 120s
> > > tcp.opening 30s
> > > tcp.established 86400s
> > > tcp.closing 900s
> > > tcp.finwait 45s
> > > tcp.closed 90s
> > > tcp.tsdiff 30s
> > >
> > > Any more guesses? May it be some hardware-related stuff?
> > >
> > > --
> > > Silver
> > >
> > >>
> > >> On 12-02-25 9:21 AM, Silver Salonen wrote:
> > >>> On Thu, 23 Feb 2012 10:49:55 -0500, Josh Fisher wrote:
> > >>>> On 2/23/2012 4:11 AM, Silver Salonen wrote:
> > >>>>> On Wednesday 22 February 2012 15:20:10 Silver Salonen wrote:
> > >>>>>
> > >>>>> What's also interesting about these failures are these lines
> > >>>>> (similar in all these failing jobs):
> > >>>>>      FD Files Written:       381
> > >>>>>      SD Files Written:       0
> > >>>>>      FD Bytes Written:       391,430,239 (391.4 MB)
> > >>>>>      SD Bytes Written:       0 (0 B)
> > >>>>>      Last Volume Bytes:      260 (260 B)
> > >>>>>
> > >>>>> And the actual volume file seems to contain all the data (its size
> > >>>>> is 373MB).
> > >>>>>
> > >>>>> What can we conclude from that?
> > >>>>> Does the failure/timeout/whatever occur after the FD--SD connection,
> > >>>>> eg. when SD tries to communicate with DIR about the end of the job or
> > >>>>> smth?
> > >>>>
> > >>>> Or does the Dir abort the job after a timeout/whatever occurs for the
> > >>>> Dir->FD connection? Since the problem started after changing network
> > >>>> environment, I suspect a switch or router is timing out the Dir->FD
> > >>>> connection, perhaps when the FD is busy compressing a large file or
> > >>>> something. Try turning compression off? Just a guess.
> > >>>
> > >>> Tried it. No changes :(
> > >>>
> > >>> --
> > >>> Silver

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>