Thank you. This will be useful for others seeking the same solution.
On Oct 22, 2012, at 7:47 PM, Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC]
wrote:
> I changed the timeout value from 6 days to 60 days in src/lib/bnet.c and
> bsoc.c. I also added the "Heartbeat Interval = 120" in bacula-dir.conf,
> bacula-sd.conf, bacula-fd.conf and bconsole.conf.
>
> bsock->timeout = 60 * 60 * 60 * 24; /* 60 days timeout */
>
> I re-compiled bacula and ran a full backup of 26TB. It completed successfully
> after 9 days.
>
> Thank you all for your help.
>
> Uthra
>
> -----Original Message-----
> From: Dan Langille [mailto:dan AT langille DOT org]
> Sent: Thursday, October 11, 2012 9:49 AM
> To: Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC]
> Cc: Martin Simmons; bacula-users AT lists.sourceforge DOT net
> Subject: Re: [Bacula-users] bacula watchdog sending kill
>
> On 2012-10-11 08:41, Martin Simmons wrote:
>>>>>>> On Wed, 10 Oct 2012 19:15:55 -0400, Dan Langille said:
>>>
>>> On Oct 10, 2012, at 5:51 PM, Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS
>>> INC] wrote:
>>>
>>>> I have bacula 5.2.10 installed on a RHEL 6 server and it has been
>>> running fine but recently we have bumped in to a problem. I am
>>> backing up our data server which is about 26TB. I started a Full
>>> backup up of this machine and the backup ran for 6 days and then the
>>> process is killed by Watchdog. Here is the information I got from the
>>> bconsole:
>>>>
>>>> 0-Oct 16:41 lindy-sd JobId 2458: User specified spool size
>>> reached.
>>>> 10-Oct 16:41 lindy-sd JobId 2458: Writing spooled data to Volume.
>>> Despooling 966,367,832,548 bytes ...
>>>> 10-Oct 16:43 lindy-dir JobId 2458: Error: Watchdog sending kill
>>> after 518406 secs to thread stalled reading File daemon.
>>>
>>> Yes, that's 6 days (as mentioned below), or close to it: 518400...
>>>
>>>> 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: Network error with
>>> FD during Backup: ERR=Interrupted system call
>>>> 10-Oct 16:43 lindy-sd JobId 2458: Fatal error: spool.c:301 Fatal
>>> append error on device "Drive-1" (/dev/nst0): ERR=
>>>> 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: No Job status
>>> returned from FD.
>>>> 10-Oct 16:43 lindy-dir JobId 2458: Error: Bacula lindy-dir 5.2.10
>>> (28Jun12):
>>>> Build OS: x86_64-unknown-linux-gnu redhat
>>> Enterprise release
>>>>
>>>>
>>>> I read about “Max Run Time = time” directive that could be set in
>>> the bacula config file. I also read that By default, the watchdog
>>> thread will kill any Job that has run more than 6 days. The maximum
>>> watchdog timeout is independent of MaxRunTime and cannot be changed??
>>>
>>> Yes, I am sure that is correct.
>>>
>>>> I am not sure if I should set this directive in my bacula config
>>> file? Has anybody encountered this issue if so how did you solve this
>>> problem?
>>>>
>>>> I would appreciate your help.
>>>>
>>>
>>> If I recall correctly, you need to make a code change, and recompile.
>>> It is a simple patch, and has been posted to this list (or at least
>>> referred to on this list in the past month. Search for 'Watchdog
>>> sending kill' and see what you find.
>>>
>>> Oh wait, you're with NASA. OK, here goes. I like marc.info
>>> archives: http://marc.info/?l=bacula-users
>>>
>>> I found the reference I was thinking of:
>>> http://marc.info/?l=bacula-users&m=134237429312031&w=2
>>>
>>> I think this is what they were referring to:
>>> http://marc.info/?l=bacula-users&m=131707949318181&w=2
>>>
>>> and it looks like src/lib/watchdog.c is your friend. I looked at
>>> that code, but couldn't figure out a solution. And now I'm out of
>>> time. Sorry.
>>
>> This 6 days timeout is in src/lib/bnet.c I think (see init_bsock).
>
> Thank you.
>
> Found it.
>
> Look for this:
>
> /*
> * ****FIXME**** reduce this to a few hours once
> * heartbeats are implemented
> */
> bsock->timeout = 60 * 60 * 6 * 24; /* 6 days timeout */
>
>
> Bump up the timeout value, recompile, and you're good to go.
>
>
>
>
> --
> Dan Langille - http://langille.org/
--
Dan Langille - http://langille.org
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|