Bacula-users

Re: [Bacula-users] bacula watchdog sending kill

2012-10-10 19:18:32
Subject: Re: [Bacula-users] bacula watchdog sending kill
From: Dan Langille <dan AT langille DOT org>
To: "Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC]" <uthra.r.rao AT nasa DOT gov>
Date: Wed, 10 Oct 2012 19:15:55 -0400
On Oct 10, 2012, at 5:51 PM, Rao, Uthra R. (GSFC-672.0)[ADNET SYSTEMS INC] 
wrote:

> I have bacula 5.2.10 installed on a RHEL 6 server and it has been running 
> fine but recently we have bumped in to a problem. I am backing up our data 
> server which is about 26TB. I started a Full backup up of this machine and 
> the backup ran for 6 days and then the process is killed by Watchdog. Here is 
> the information I got from the bconsole:
>  
> 0-Oct 16:41 lindy-sd JobId 2458: User specified spool size reached.
> 10-Oct 16:41 lindy-sd JobId 2458: Writing spooled data to Volume. Despooling 
> 966,367,832,548 bytes ...
> 10-Oct 16:43 lindy-dir JobId 2458: Error: Watchdog sending kill after 518406 
> secs to thread stalled reading File daemon.

Yes, that's 6 days (as mentioned below), or close to it: 518400...

> 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: Network error with FD during 
> Backup: ERR=Interrupted system call
> 10-Oct 16:43 lindy-sd JobId 2458: Fatal error: spool.c:301 Fatal append error 
> on device "Drive-1" (/dev/nst0): ERR=
> 10-Oct 16:43 lindy-dir JobId 2458: Fatal error: No Job status returned from 
> FD.
> 10-Oct 16:43 lindy-dir JobId 2458: Error: Bacula lindy-dir 5.2.10 (28Jun12):
>   Build OS:               x86_64-unknown-linux-gnu redhat Enterprise release
>  
>  
> I read about “Max Run Time = time” directive that could be set in the bacula 
> config file. I also read that By default, the watchdog thread will kill any 
> Job that has run more than 6 days. The maximum watchdog timeout is 
> independent of MaxRunTime and cannot be changed??

Yes, I am sure that is correct.

>  I am not sure if I should set this directive in my bacula config file? Has 
> anybody encountered this issue if so how did you solve this problem?
>  
> I would appreciate your help.
>  

If I recall correctly, you need to make a code change, and recompile.  It is a 
simple patch, and has been posted to this list (or at least referred to on this 
list in the past month.  Search for 'Watchdog sending kill' and see what you 
find.

Oh wait, you're with NASA.  OK, here goes.   I like marc.info archives: 
http://marc.info/?l=bacula-users

I found the reference I was thinking of: 
http://marc.info/?l=bacula-users&m=134237429312031&w=2

I think this is what they were referring to: 
http://marc.info/?l=bacula-users&m=131707949318181&w=2

and it looks like src/lib/watchdog.c is your friend.  I looked at that code, 
but couldn't figure out a solution.  And now I'm out of time.  Sorry.

BTW, are you using PostgreSQL for Bacula there?  :)

-- 
Dan Langille - http://langille.org


------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>