Bacula-users

Re: [Bacula-users] bacula watchdog sending kill

2012-10-23 12:49:24
Subject: Re: [Bacula-users] bacula watchdog sending kill
From: Dan Langille <dan AT langille DOT org>
To: "Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC]" <uthra.r.rao AT nasa DOT gov>
Date: Tue, 23 Oct 2012 18:08:11 +0200
Thank you.  This will be useful for others seeking the same solution.

On Oct 22, 2012, at 7:47 PM, Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC] 
wrote:

> I changed the timeout value from 6 days to 60 days in src/lib/bnet.c and 
> bsoc.c. I also added the "Heartbeat Interval = 120" in bacula-dir.conf, 
> bacula-sd.conf, bacula-fd.conf and bconsole.conf.
> 
> bsock->timeout = 60 * 60 * 60 * 24;   /* 60 days timeout */
> 
> I re-compiled bacula and ran a full backup of 26TB. It completed successfully 
> after 9 days.
> 
> Thank you all for your help.
> 
> Uthra
> 
> -----Original Message-----
> From: Dan Langille [mailto:dan AT langille DOT org] 
> Sent: Thursday, October 11, 2012 9:49 AM
> To: Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC]
> Cc: Martin Simmons; bacula-users AT lists.sourceforge DOT net
> Subject: Re: [Bacula-users] bacula watchdog sending kill
> 
> On 2012-10-11 08:41, Martin Simmons wrote:
>>>>>>> On Wed, 10 Oct 2012 19:15:55 -0400, Dan Langille said:
>>> 
>>> On Oct 10, 2012, at 5:51 PM, Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS 
>>> INC] wrote:
>>> 
>>>> I have bacula 5.2.10 installed on a RHEL 6 server and it has been
>>> running fine but recently we have bumped in to a problem. I am 
>>> backing up our data server which is about 26TB. I started a Full 
>>> backup up of this machine and the backup ran for 6 days and then the 
>>> process is killed by Watchdog. Here is the information I got from the 
>>> bconsole:
>>>> 
>>>> 0-Oct 16:41 lindy-sd JobId 2458: User specified spool size
>>> reached.
>>>> 10-Oct 16:41 lindy-sd JobId 2458: Writing spooled data to Volume. 
>>> Despooling 966,367,832,548 bytes ...
>>>> 10-Oct 16:43 lindy-dir JobId 2458: Error: Watchdog sending kill
>>> after 518406 secs to thread stalled reading File daemon.
>>> 
>>> Yes, that's 6 days (as mentioned below), or close to it: 518400...
>>> 
>>>> 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: Network error with
>>> FD during Backup: ERR=Interrupted system call
>>>> 10-Oct 16:43 lindy-sd JobId 2458: Fatal error: spool.c:301 Fatal
>>> append error on device "Drive-1" (/dev/nst0): ERR=
>>>> 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: No Job status
>>> returned from FD.
>>>> 10-Oct 16:43 lindy-dir JobId 2458: Error: Bacula lindy-dir 5.2.10
>>> (28Jun12):
>>>>  Build OS:               x86_64-unknown-linux-gnu redhat 
>>> Enterprise release
>>>> 
>>>> 
>>>> I read about “Max Run Time = time” directive that could be set in
>>> the bacula config file. I also read that By default, the watchdog 
>>> thread will kill any Job that has run more than 6 days. The maximum 
>>> watchdog timeout is independent of MaxRunTime and cannot be changed??
>>> 
>>> Yes, I am sure that is correct.
>>> 
>>>> I am not sure if I should set this directive in my bacula config
>>> file? Has anybody encountered this issue if so how did you solve this 
>>> problem?
>>>> 
>>>> I would appreciate your help.
>>>> 
>>> 
>>> If I recall correctly, you need to make a code change, and recompile.  
>>> It is a simple patch, and has been posted to this list (or at least 
>>> referred to on this list in the past month.  Search for 'Watchdog 
>>> sending kill' and see what you find.
>>> 
>>> Oh wait, you're with NASA.  OK, here goes.   I like marc.info 
>>> archives: http://marc.info/?l=bacula-users
>>> 
>>> I found the reference I was thinking of: 
>>> http://marc.info/?l=bacula-users&m=134237429312031&w=2
>>> 
>>> I think this is what they were referring to: 
>>> http://marc.info/?l=bacula-users&m=131707949318181&w=2
>>> 
>>> and it looks like src/lib/watchdog.c is your friend.  I looked at 
>>> that code, but couldn't figure out a solution.  And now I'm out of 
>>> time.  Sorry.
>> 
>> This 6 days timeout is in src/lib/bnet.c I think (see init_bsock).
> 
> Thank you.
> 
> Found it.
> 
> Look for this:
> 
>    /*
>     * ****FIXME**** reduce this to a few hours once
>     *   heartbeats are implemented
>     */
>    bsock->timeout = 60 * 60 * 6 * 24;   /* 6 days timeout */
> 
> 
> Bump up the timeout value, recompile, and you're good to go.
> 
> 
> 
> 
> --
> Dan Langille - http://langille.org/

-- 
Dan Langille - http://langille.org


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>