Bacula-users

Re: [Bacula-users] bacula watchdog sending kill

2012-10-11 08:49:24
Subject: Re: [Bacula-users] bacula watchdog sending kill
From: Martin Simmons <martin AT lispworks DOT com>
To: Dan Langille <dan AT langille DOT org>
Date: Thu, 11 Oct 2012 13:41:45 +0100
>>>>> On Wed, 10 Oct 2012 19:15:55 -0400, Dan Langille said:
> 
> On Oct 10, 2012, at 5:51 PM, Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC] 
> wrote:
> 
> > I have bacula 5.2.10 installed on a RHEL 6 server and it has been running 
> > fine but recently we have bumped in to a problem. I am backing up our data 
> > server which is about 26TB. I started a Full backup up of this machine and 
> > the backup ran for 6 days and then the process is killed by Watchdog. Here 
> > is the information I got from the bconsole:
> >  
> > 0-Oct 16:41 lindy-sd JobId 2458: User specified spool size reached.
> > 10-Oct 16:41 lindy-sd JobId 2458: Writing spooled data to Volume. 
> > Despooling 966,367,832,548 bytes ...
> > 10-Oct 16:43 lindy-dir JobId 2458: Error: Watchdog sending kill after 
> > 518406 secs to thread stalled reading File daemon.
> 
> Yes, that's 6 days (as mentioned below), or close to it: 518400...
> 
> > 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: Network error with FD 
> > during Backup: ERR=Interrupted system call
> > 10-Oct 16:43 lindy-sd JobId 2458: Fatal error: spool.c:301 Fatal append 
> > error on device "Drive-1" (/dev/nst0): ERR=
> > 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: No Job status returned from 
> > FD.
> > 10-Oct 16:43 lindy-dir JobId 2458: Error: Bacula lindy-dir 5.2.10 (28Jun12):
> >   Build OS:               x86_64-unknown-linux-gnu redhat Enterprise release
> >  
> >  
> > I read about “Max Run Time = time” directive that could be set in the 
> > bacula config file. I also read that By default, the watchdog thread will 
> > kill any Job that has run more than 6 days. The maximum watchdog timeout is 
> > independent of MaxRunTime and cannot be changed??
> 
> Yes, I am sure that is correct.
> 
> >  I am not sure if I should set this directive in my bacula config file? Has 
> > anybody encountered this issue if so how did you solve this problem?
> >  
> > I would appreciate your help.
> >  
> 
> If I recall correctly, you need to make a code change, and recompile.  It is 
> a simple patch, and has been posted to this list (or at least referred to on 
> this list in the past month.  Search for 'Watchdog sending kill' and see what 
> you find.
> 
> Oh wait, you're with NASA.  OK, here goes.   I like marc.info archives: 
> http://marc.info/?l=bacula-users
> 
> I found the reference I was thinking of: 
> http://marc.info/?l=bacula-users&m=134237429312031&w=2
> 
> I think this is what they were referring to: 
> http://marc.info/?l=bacula-users&m=131707949318181&w=2
> 
> and it looks like src/lib/watchdog.c is your friend.  I looked at that code, 
> but couldn't figure out a solution.  And now I'm out of time.  Sorry.

This 6 days timeout is in src/lib/bnet.c I think (see init_bsock).

__Martin

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>