Bacula-users

Re: [Bacula-users] bacula watchdog killing tape-loading

2008-08-07 17:23:08
Subject: Re: [Bacula-users] bacula watchdog killing tape-loading
From: Nils Blanck-Wehde <nils.blanck-wehde AT backofficeservice DOT biz>
Date: Thu, 07 Aug 2008 23:22:35 +0200
Hi!

Yes, those commands are executed ok. I can even repeat the mtx-changer command that just failed and it will terminate normally. I wonder if it is possible to deactivate the watchdog functionality completely. My theory is that everything will work then. I think it's quite strange that timeouts occur only seconds after the command was issued...
Maybe it's the watchdog, that's gone crazy....

Nils


T. Horsnell schrieb:
Hi Nils,
Maybe something is failing when the mt driver tries to put the tape 
online. Once the tape is loaded into the drive, (after the timeout 
termination) are you able to do anything with it using mt?

Like:
mt -f /dev/nst0 status
mt -f /dev/nst0 fsf

and suchlike?

T.




  
Hi Terry,

you are correct. I did not change anything (unless there are mechanisms 
within CentOS 5.2 that change things on their own). SE-Linus is 
completely disabled on this system.

Backup to tape went fine for a while then a job terminated with the 
following output:

06-Aug 16:32 company_bacula-dir JobId 207: Recycled volume "CNV761"
06-Aug 16:32 company_bacula-sd JobId 207: 3301 Issuing autochanger "loaded? drive 0" command.
06-Aug 16:32 company_bacula-sd JobId 207: 3302 Autochanger "loaded? drive 0", result: nothing loaded.
06-Aug 16:32 company_bacula-sd JobId 207: 3304 Issuing autochanger "load slot 1, drive 0" command.
06-Aug 16:37 company_bacula-sd JobId 207: Fatal error: 3992 Bad autochanger "load slot 1, drive 0": ERR=Child died from signal 15: Termination.
Results=Drive 0 Full (Storage Element 1 loaded)
Program killed by Bacula watchdog (timeout)

06-Aug 16:35 company-appsrv-fd JobId 207: Fatal error: ../../filed/job.c:1817 Bad response to Append Data command. Wanted 3000 OK data
, got 3903 Error append data

Since then all autochanger-related commands die with: "ERR=Child died 
from signal 15: Termination."
I think it's worth mentioning that BEFORE the error is reported all 
autochanger commands were executed correctly.
For example btape test tells me:

3993 Bad autochanger command: /usr/lib/bacula/mtx-changer /dev/sg3 load 
1 /dev/nst0 0
3993 result="Loading media from Storage Element 1 into drive 0...done
Program killed by Bacula watchdog (timeout)
": ERR=Child died from signal 15: Termination

allthough the load command was executed correct. From monitoring the 
autochanger I cannot see ANY problems whith the communication with 
bacula. If bacula wants to change tapes, the autochanger changes tapes. 
But even after bacula prints "...done" after a command it tells me that 
something was killed by bacula watchdog. At this time, the 
autochanger-operation already completed successfully.
I have tried several mtx (load / unload) commands and I got no errors so 
I assume the autochanger is working ok.

Maybe another interesting thing is that those timeouts occur seconds 
after I issue the command allthough I have set all timeouts to ten 
minutes as it was suggested today. It doesn't look like changing 
timeouts in the storage definition has any impact on the problem.

Here is the definition:

Autochanger {
  Name = QS3DLT
  Device = DLT-Drive-1
  Changer Command = "/usr/lib/bacula/mtx-changer %c %o %S %a %d"
  Changer Device = /dev/sg3
}

Device {
  Name = DLT-Drive-1                      #
  Drive Index = 0
  Media Type = DLT-VS1
  Archive Device = /dev/nst0
  AutomaticMount = yes;               # when device opened, read it
  AlwaysOpen = no;
  RemovableMedia = yes;
  RandomAccess = no;
  AutoChanger = yes
  Maximum Changer Wait = 10
  Maximum Rewind Wait = 10
  Maximum Open Wait = 10
}

I am out of ideas right now...

Nils


T. Horsnell schrieb:

    
So, clutching at straws, is it definitely the case that you have made 
*no* changes to the system since things last worked properly? No 
upgrades, no messing with selinux, no messing with mysql (assuming that 
is your db system) no messing with firewalls etc. And do you still have 
space on the filesystem which contains Bacula's workspace and catalogue? 
At one point in your stuff below, you got:

      
3302 Issuing autochanger "unload 1 0" command.
unload status=Bad 134217743
              
I wonder what the status of 134217743 means....

Cheers,
Terry.

 

      
What groups is the bacula user in and what are the perms on /dev/sg3?
			Thanks -Jason

On Thu, 2008-08-07 at 14:02 +0200, Nils Blanck-Wehde wrote:

   

        
Hi Terry,

sorry, I forgot to mention: OS is CentOS 5.2, bacula-version is 2.4.2.
Problems started all out of a sudden with 2.4.1 after running fine for
weeks.

Calling mtx and mtx-changer both as root as well as user bacula works
flawlessly (slot 1 is currently loaded):

[root@company-Backupserver ~]# mtx -f /dev/sg3 load 1 0
Drive 0 Full (Storage Element 1 loaded)
[root@company-Backupserver ~]# su - bacula
-bash-3.2$ /usr/sbin/mtx -f /dev/sg3 load 1 0
Drive 0 Full (Storage Element 1 loaded)

[root@company-Backupserver ~]# /usr/lib/bacula/mtx-changer /dev/sg3
loaded 1 /dev/nst0 0
1
[root@company-Backupserver ~]# su - bacula
-bash-3.2$ /usr/lib/bacula/mtx-changer /dev/sg3 loaded 1 /dev/nst0 0
1
-bash-3.2$

I can see from the webadmin of the autochanger (Quantum Superloader 3
DLT) that autochanger commands are being executed correctly, still
bacula reports "ERR=Child died from signal 15: Termination".

Nils


T. Horsnell schrieb: 

     

          
Assuming your O/S is a Unix/Linux of some sort, have you tried the
basic mtx command on it? 
Something like 

mtx -f /dev/sg3 load 1 0 

(See 'man mtx') 

Cheers, 
Terry 



       

            
Hi all, 
bacula won't work with our autochanger anymore. I can't find the
source of the problems. 
Here is the output of the autochanger-test of: "btape
-c /etc/bacula/bacula-sd.conf /dev/nst0": 

=== Autochanger test === 

3301 Issuing autochanger "loaded" command. 
Slot 1 loaded. I am going to unload it. 
3302 Issuing autochanger "unload 1 0" command. 
unload status=Bad 134217743 
3992 Bad autochanger command: /usr/lib/bacula/mtx-changer /dev/sg3
unload 1 /dev/nst0 0 
3992 result="Unloading drive 0 into Storage Element 1...done 
Program killed by Bacula watchdog (timeout) 
": ERR=Child died from signal 15: Termination 
3303 Issuing autochanger "load 1 0" command. 
3993 Bad autochanger command: /usr/lib/bacula/mtx-changer /dev/sg3
load 1 /dev/nst0 0 
3993 result="Loading media from Storage Element 1 into drive
0...done 
Program killed by Bacula watchdog (timeout) 
": ERR=Child died from signal 15: Termination 
You must correct this error or the Autochanger will not work. 

This is the storage-definition: 

Autochanger { 
Name = QS3DLT 
Device = DLT-Drive-1 
Changer Command = "/usr/lib/bacula/mtx-changer %c %o %S %a %d" 
Changer Device = /dev/sg3 
} 

Device { 
Name = DLT-Drive-1                      # 
Drive Index = 0 
Media Type = DLT-VS1 
Archive Device = /dev/nst0 
AutomaticMount = yes;               # when device opened, read
it 
AlwaysOpen = no; 
RemovableMedia = yes; 
RandomAccess = no; 
AutoChanger = yes 
Maximum Changer Wait = 10 
Maximum Rewind Wait = 10 
Maximum Open Wait = 10 
} 

When I look at the webadmin of the autochanger I see the
autochanger and the drive perform exactly the requested operations
at the usual speed (~1:30min for an unload operation, ~3:50 for an
unload/load operation). 

Still I get lots of killing / timeout problems. 

I start to wonder if the autochanger is somewhat defective... 

If any of you guys can help I would greatly appreciate it. 

All the best, Nils 



Nils Blanck-Wehde schrieb: 


         

              
Hi John, 

thanks for your help. I am quite new to bacula and it seems to
take some time to fully understand it :-) 
I am not sure whether it really is a timeout problem. 
I increased all timeout values to 10 minutes and the killing
still occurs (after 2:20min): 

07-Aug 12:20 company_bacula-dir JobId 217: Start Backup JobId
217,
Job=Fileserver_Lexware_Exchange_to_Tape.2008-08-07_12.20.03 
07-Aug 12:20 company_bacula-dir JobId 217: Using Device
"DLT-Drive-1" 
07-Aug 12:20 company_bacula-sd JobId 217: 3301 Issuing
autochanger "loaded? drive 0" command. 
07-Aug 12:20 company_bacula-sd JobId 217: 3302 Autochanger
"loaded? drive 0", result: nothing loaded. 
07-Aug 12:20 company_bacula-sd JobId 217: 3304 Issuing
autochanger "load slot 8, drive 0" command. 
*messages 
07-Aug 12:22 company_bacula-sd JobId 217: Fatal error: 3992 Bad
autochanger "load slot 8, drive 0": ERR=Child died from signal
15: Termination. 
Results=Loading media from Storage Element 8 into drive
0...done 
Program killed by Bacula watchdog (timeout) 

07-Aug 12:20 company-appsrv-fd JobId 217: Fatal
error: ../../filed/job.c:1817 Bad response to Append Data
command. Wanted 3000 OK data, got 3903 Error append data 


I doublechecked the time needed by the autochanger for an
unload/load operation: I did a couple of operations and they all
terminated in less than four minutes. 
I think there is a general problem with the autochanger because
btape test throws the following error when testing the
autochanger: 

3301 Issuing autochanger "loaded" command. 
3991 Bad autochanger
command: /usr/lib/bacula/mtx-changer /dev/sg3 loaded 1 /dev/nst0
0 
3991 result="": ERR=Child died from signal 15: Termination 
You must correct this error or the Autochanger will not work. 

When I run this command "/usr/lib/bacula/mtx-changer /dev/sg3
loaded 1 /dev/nst0 0" manually, both as root and as user bacula,
it returns "1": 

[root@company-Backupserver ~]# su - bacula 
-bash-3.2$ /usr/lib/bacula/mtx-changer /dev/sg3 loaded
1 /dev/nst0 0 
1 
-bash-3.2$ 

Could it be a permission problem? What Uid Gid should
mtx-changer have? Could some automatic mechanism on my CentOS
installation have changed the permissions of the autochanger
device to a lower level? 

I am a little stuck here. 

Thanks for all help! 

Nils 


John Drescher schrieb: 


           

                
On Wed, Aug 6, 2008 at 1:42 PM, Nils Blanck-Wehde 
<nils.blanck-wehde AT backofficeservice DOT biz>
<mailto:nils.blanck-wehde AT backofficeservice DOT biz> wrote: 



             

                  
Hello list, 

I just encountered this issue: 

06-Aug 19:31 company_bacula-sd JobId 210: Fatal error: 3992
Bad 
autochanger "load slot 8, drive 0": ERR=Child died from
signal 15: 
Termination. 
Results=Loading media from Storage Element 8 into drive
0...done 
Program killed by Bacula watchdog (timeout) 


Earlier today I got this message: 

06-Aug 17:10 company_bacula-dir JobId 208: Using Device
"DLT-Drive-1" 
06-Aug 17:10 company_bacula-sd JobId 208: 3301 Issuing
autochanger "loaded? drive 0" command. 
06-Aug 17:10 company_bacula-sd JobId 208: 3302 Autochanger
"loaded? drive 0", result: nothing loaded. 
06-Aug 17:10 company_bacula-sd JobId 208: 3304 Issuing
autochanger "load slot 1, drive 0" command. 
06-Aug 17:13 company-appsrv-fd JobId 208: Fatal
error: ../../filed/job.c:1817 Bad response to Append Data
command. Wanted 3000 OK data 
, got 3903 Error append data 

06-Aug 17:15 company_bacula-sd JobId 208: Fatal error: 3992
Bad autochanger "load slot 1, drive 0": ERR=Child died from
signal 15: Termination. 
Results=Loading media from Storage Element 1 into drive
0...done 
Program killed by Bacula watchdog (timeout) 

 
               
                    
The default maximum changer wait is 5 minutes. If the changer
does not 
complete in 5 minutes bacula will kill the mtx-changer
script. 

See Maximum Changer Wait in 

http://bacula.org/en/rel-manual/Storage_Daemon_Configuratio.html 

John 

             
                  

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url="">
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
    

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url="">
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users


!DSPAM:489b4fd2120211987424007!

  

--

 

BackOfficeService - Beratung und Service für Ihre IT -

 

Anschrift:

Niederkastenholzer Str. 40

53881 Euskirchen

Telefon:   +49 2255 953204

Fax:          +49 2255 953208

Mobil:      +49 177 3397547

Bankverbindung:

Raiffeisenbank Rheinbach Voreifel eG

Kto. Nr. 340286014 (BLZ 370 696 27)

Online:

info@backofficeservice.biz

www.backofficeservice.biz

 

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users