Bacula-users

Re: [Bacula-users] bacula watchdog sending kill

2012-10-11 09:51:46
Subject: Re: [Bacula-users] bacula watchdog sending kill
From: Dan Langille <dan AT langille DOT org>
To: <uthra.r.rao AT nasa DOT gov>
Date: Thu, 11 Oct 2012 09:49:23 -0400
On 2012-10-11 08:41, Martin Simmons wrote:
>>>>>> On Wed, 10 Oct 2012 19:15:55 -0400, Dan Langille said:
>>
>> On Oct 10, 2012, at 5:51 PM, Rao, Uthra R. (GSFC-672.0)[ADNET 
>> SYSTEMS INC] wrote:
>>
>> > I have bacula 5.2.10 installed on a RHEL 6 server and it has been 
>> running fine but recently we have bumped in to a problem. I am backing 
>> up our data server which is about 26TB. I started a Full backup up of 
>> this machine and the backup ran for 6 days and then the process is 
>> killed by Watchdog. Here is the information I got from the bconsole:
>> >
>> > 0-Oct 16:41 lindy-sd JobId 2458: User specified spool size 
>> reached.
>> > 10-Oct 16:41 lindy-sd JobId 2458: Writing spooled data to Volume. 
>> Despooling 966,367,832,548 bytes ...
>> > 10-Oct 16:43 lindy-dir JobId 2458: Error: Watchdog sending kill 
>> after 518406 secs to thread stalled reading File daemon.
>>
>> Yes, that's 6 days (as mentioned below), or close to it: 518400...
>>
>> > 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: Network error with 
>> FD during Backup: ERR=Interrupted system call
>> > 10-Oct 16:43 lindy-sd JobId 2458: Fatal error: spool.c:301 Fatal 
>> append error on device "Drive-1" (/dev/nst0): ERR=
>> > 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: No Job status 
>> returned from FD.
>> > 10-Oct 16:43 lindy-dir JobId 2458: Error: Bacula lindy-dir 5.2.10 
>> (28Jun12):
>> >   Build OS:               x86_64-unknown-linux-gnu redhat 
>> Enterprise release
>> >
>> >
>> > I read about “Max Run Time = time” directive that could be set in 
>> the bacula config file. I also read that By default, the watchdog 
>> thread will kill any Job that has run more than 6 days. The maximum 
>> watchdog timeout is independent of MaxRunTime and cannot be changed??
>>
>> Yes, I am sure that is correct.
>>
>> >  I am not sure if I should set this directive in my bacula config 
>> file? Has anybody encountered this issue if so how did you solve this 
>> problem?
>> >
>> > I would appreciate your help.
>> >
>>
>> If I recall correctly, you need to make a code change, and 
>> recompile.  It is a simple patch, and has been posted to this list (or 
>> at least referred to on this list in the past month.  Search for 
>> 'Watchdog sending kill' and see what you find.
>>
>> Oh wait, you're with NASA.  OK, here goes.   I like marc.info 
>> archives: http://marc.info/?l=bacula-users
>>
>> I found the reference I was thinking of: 
>> http://marc.info/?l=bacula-users&m=134237429312031&w=2
>>
>> I think this is what they were referring to: 
>> http://marc.info/?l=bacula-users&m=131707949318181&w=2
>>
>> and it looks like src/lib/watchdog.c is your friend.  I looked at 
>> that code, but couldn't figure out a solution.  And now I'm out of 
>> time.  Sorry.
>
> This 6 days timeout is in src/lib/bnet.c I think (see init_bsock).

Thank you.

Found it.

Look for this:

    /*
     * ****FIXME**** reduce this to a few hours once
     *   heartbeats are implemented
     */
    bsock->timeout = 60 * 60 * 6 * 24;   /* 6 days timeout */


Bump up the timeout value, recompile, and you're good to go.




-- 
Dan Langille - http://langille.org/

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>