Bacula-users

Re: [Bacula-users] Missing nfs share blocks job [fd: 2.2.8]

2008-11-13 08:35:50
Subject: Re: [Bacula-users] Missing nfs share blocks job [fd: 2.2.8]
From: Ronald Buder <rbuder AT proficom-ag DOT de>
To: bacula-users AT lists.sourceforge DOT net
Date: Thu, 13 Nov 2008 14:32:38 +0100
Arno Lehmann wrote:
> Hi,
>
> 13.11.2008 12:05, Ronald Buder wrote:
>   
>> Hi,
>>
>> we have noticed a blocker which may be resolved in later versions of the 
>> file daemon, if not I will file it as a bug. If, for whatsoever reason a 
>> network share breaks away, which is (implicitly) included in the fileset 
>> the job will stall.
>>     
>
> This is normal NFS behaviour - if a NFS server doesn't respond, the 
> processes accessing it wait in an uninterruptible state. They also do 
> not get notification of a problem by a signal.
>   
That's what I was afraid of...
> That said, newer NFS client implementations allow to change that 
> behaviour - under linux, the nfs mount options "soft" and "intr" can 
> be used to allow client processes to be notified of unavailable NFS 
> shares.
>   
>   
>> At this very moment I am waiting for four backup 
>> jobs. I have tried to cancel them without any success. The jobs have 
>> been running for some 8 hours now, cancellation attempt was roundabout 3 
>> hours ago. As the rest of the system is still up and running and doing 
>> backups and migration I do not want to restart the director.
>>     
>
> You will have to either restart the clients that mount the NFS shares, 
> or make the NFS server responsive again.
>   
Is there no way at all to make a job, which has stalled due to 
filesystem "restrictions", time out? I wonder if other (network) 
filesystems or even storage devices might opt for a similar behaviour.
>   
>> Running Jobs:
>> Console connected at 13-Nov-08 10:16
>>  JobId Level   Name                       Status
>> ======================================================================
>>  41637 Increme  PLATON-W0001_System.2008-11-13_04.00.21 has been canceled
>>  41641 Increme  PLATON-W0003_System.2008-11-13_04.00.25 has been canceled
>>  41643 Increme  PLATON-W0004_System.2008-11-13_04.00.27 has been canceled
>>  41645 Increme  PLATON-W0005_System.2008-11-13_04.00.29 has been canceled
>>
>> Due to a server failure the nfs shares are not available anymore. I 
>> would like to see some sort of a timeout at least if that is at all 
>> possible.
>>     
>
> That's not possible inside Bacula - the FD simply can't terminate file 
> system accesses that are stalled due to NFS problems.
>
> The best thing to do is often a restart of the NFS server.
>   
In this here case the users of the effected systems let us know that the 
share may be unmounted as it hasn't been working for a bit already and 
is not being needed. As we're not in charge of the systems effect and 
are only stepping in upon trouble we have no means of detecting and 
monitoring such problems. We only run into the consequences, 
unfortunately. Once unmounted the backup jobs would continue as if 
nothing happened. I still wonder if there is some sort of a time out for 
filesystem operation, not limited to nfs but generally.
> Arno
>   
Ronald

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users