Bacula-users

Re: [Bacula-users] Backup of all jobs fail if host unavailable

2008-09-02 16:18:14
Subject: Re: [Bacula-users] Backup of all jobs fail if host unavailable
From: Dan Langille <dan AT langille DOT org>
To: "Botha, Jacques (FNB)" <JacquesB AT fnb.co DOT za>
Date: Tue, 02 Sep 2008 16:18:00 -0400
Botha, Jacques (FNB) wrote:
> On Tue, 2008-09-02 at 16:01 -0400, Dan Langille wrote:
>> Botha, Jacques (FNB) wrote:
>>> On Tue, 2008-09-02 at 15:46 -0400, Dan Langille wrote:
>>>> Botha, Jacques (FNB) wrote:
>>>>> Hi 
>>>>>
>>>>> All my backups are scheduled for the same time, then queue with the same
>>>>> priority, and run one at a time as the previous jobs finishes.
>>>>>
>>>>> Today I've got a machine that is unavailable due to a hardware fault.
>>>>> Naturally the backup for this machine failed, but, also every backup
>>>>> that was in the queue for all other machines after this one ! 
>>>>>
>>>>> Please help !
>>>>>
>>>>> I'm running bacula 2.4.2 on CentOS 5.
>>>> Perhaps if you supplied the failure messages...
>>>>
>>>
>>> Sure
>>>
>>>
>>> 2008-09-02 20:15:19Bacula_Director JobId 175: Fatal error: Max wait time
>>> exceeded. Job canceled.
>>> 2008-09-02 20:15:19Bacula_Director JobId 176: Fatal error: Max wait time
>>> exceeded. Job canceled.
>>> 2008-09-02 20:15:19Bacula_Director JobId 177: Fatal error: Max wait time
>>> exceeded. Job canceled.
>>>
>>> And so forth until the last job.
>>>
>>>
>>>
>>> Some more config information which might be usefull:   
>>>
>>> Maximum Concurrent Jobs = 1
>>>
>>> each job has  Max Wait Time = 10 minutes defined.
>>>
>>>
>>> So my understanding is that the unavailable machine would have blocked
>>> all other backups for 10 minutes until it timed out, but then they
>>> should have continued, not be cancelled as well.
>>>
>>> Where am I going wrong ?
>> Max-wait time is perhaps not what you want.  Remove it or reconsider its 
>> use.
>>
> 
> According to the Bacula Manual: 
> 
> Max Wait Time = <time> The time specifies the maximum allowed
> time that a job may block waiting for a resource (such as waiting
> for a tape to be mounted, or waiting for the storage or file daemons
> to perform their duties), counted from the when the job starts, (not
> necessarily the same as when the job was scheduled).
> 
> So the unavailable machine, could block other jobs for 10 minutes.  Why
> did the other jobs time out as well ?  They were not started yet, only
> scheduled ?
> 
> If Max Wait Time is not what I am after, could you please point me in
> the right direction ??

I don't know the answers.  I was short in my reply.  Sorry.  I mean: 
stop using max wait time in the short term, to get your jobs running. 
Hopefully someone else can help.

But off hand, I think max wait time is doing the wrong thing here.  Post 
your entire job definition and we'll see.

-- 
Dan Langille

BSDCan - The Technical BSD Conference : http://www.bsdcan.org/
PGCon  - The PostgreSQL Conference:     http://www.pgcon.org/

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users