Re: [Bacula-users] bacula watchdog killing tape-loading
2008-08-07 17:23:08
Hi!
Yes, those commands are executed ok. I can even repeat the mtx-changer
command that just failed and it will terminate normally. I wonder if it
is possible to deactivate the watchdog functionality completely. My
theory is that everything will work then. I think it's quite strange
that timeouts occur only seconds after the command was issued...
Maybe it's the watchdog, that's gone crazy....
Nils
T. Horsnell schrieb:
Hi Nils,
Maybe something is failing when the mt driver tries to put the tape
online. Once the tape is loaded into the drive, (after the timeout
termination) are you able to do anything with it using mt?
Like:
mt -f /dev/nst0 status
mt -f /dev/nst0 fsf
and suchlike?
T.
Hi Terry,
you are correct. I did not change anything (unless there are mechanisms
within CentOS 5.2 that change things on their own). SE-Linus is
completely disabled on this system.
Backup to tape went fine for a while then a job terminated with the
following output:
06-Aug 16:32 company_bacula-dir JobId 207: Recycled volume "CNV761"
06-Aug 16:32 company_bacula-sd JobId 207: 3301 Issuing autochanger "loaded? drive 0" command.
06-Aug 16:32 company_bacula-sd JobId 207: 3302 Autochanger "loaded? drive 0", result: nothing loaded.
06-Aug 16:32 company_bacula-sd JobId 207: 3304 Issuing autochanger "load slot 1, drive 0" command.
06-Aug 16:37 company_bacula-sd JobId 207: Fatal error: 3992 Bad autochanger "load slot 1, drive 0": ERR=Child died from signal 15: Termination.
Results=Drive 0 Full (Storage Element 1 loaded)
Program killed by Bacula watchdog (timeout)
06-Aug 16:35 company-appsrv-fd JobId 207: Fatal error: ../../filed/job.c:1817 Bad response to Append Data command. Wanted 3000 OK data
, got 3903 Error append data
Since then all autochanger-related commands die with: "ERR=Child died
from signal 15: Termination."
I think it's worth mentioning that BEFORE the error is reported all
autochanger commands were executed correctly.
For example btape test tells me:
3993 Bad autochanger command: /usr/lib/bacula/mtx-changer /dev/sg3 load
1 /dev/nst0 0
3993 result="Loading media from Storage Element 1 into drive 0...done
Program killed by Bacula watchdog (timeout)
": ERR=Child died from signal 15: Termination
allthough the load command was executed correct. From monitoring the
autochanger I cannot see ANY problems whith the communication with
bacula. If bacula wants to change tapes, the autochanger changes tapes.
But even after bacula prints "...done" after a command it tells me that
something was killed by bacula watchdog. At this time, the
autochanger-operation already completed successfully.
I have tried several mtx (load / unload) commands and I got no errors so
I assume the autochanger is working ok.
Maybe another interesting thing is that those timeouts occur seconds
after I issue the command allthough I have set all timeouts to ten
minutes as it was suggested today. It doesn't look like changing
timeouts in the storage definition has any impact on the problem.
Here is the definition:
Autochanger {
Name = QS3DLT
Device = DLT-Drive-1
Changer Command = "/usr/lib/bacula/mtx-changer %c %o %S %a %d"
Changer Device = /dev/sg3
}
Device {
Name = DLT-Drive-1 #
Drive Index = 0
Media Type = DLT-VS1
Archive Device = /dev/nst0
AutomaticMount = yes; # when device opened, read it
AlwaysOpen = no;
RemovableMedia = yes;
RandomAccess = no;
AutoChanger = yes
Maximum Changer Wait = 10
Maximum Rewind Wait = 10
Maximum Open Wait = 10
}
I am out of ideas right now...
Nils
T. Horsnell schrieb:
So, clutching at straws, is it definitely the case that you have made
*no* changes to the system since things last worked properly? No
upgrades, no messing with selinux, no messing with mysql (assuming that
is your db system) no messing with firewalls etc. And do you still have
space on the filesystem which contains Bacula's workspace and catalogue?
At one point in your stuff below, you got:
3302 Issuing autochanger "unload 1 0" command.
unload status=Bad 134217743
I wonder what the status of 134217743 means....
Cheers,
Terry.
What groups is the bacula user in and what are the perms on /dev/sg3?
Thanks -Jason
On Thu, 2008-08-07 at 14:02 +0200, Nils Blanck-Wehde wrote:
Hi Terry,
sorry, I forgot to mention: OS is CentOS 5.2, bacula-version is 2.4.2.
Problems started all out of a sudden with 2.4.1 after running fine for
weeks.
Calling mtx and mtx-changer both as root as well as user bacula works
flawlessly (slot 1 is currently loaded):
[root@company-Backupserver ~]# mtx -f /dev/sg3 load 1 0
Drive 0 Full (Storage Element 1 loaded)
[root@company-Backupserver ~]# su - bacula
-bash-3.2$ /usr/sbin/mtx -f /dev/sg3 load 1 0
Drive 0 Full (Storage Element 1 loaded)
[root@company-Backupserver ~]# /usr/lib/bacula/mtx-changer /dev/sg3
loaded 1 /dev/nst0 0
1
[root@company-Backupserver ~]# su - bacula
-bash-3.2$ /usr/lib/bacula/mtx-changer /dev/sg3 loaded 1 /dev/nst0 0
1
-bash-3.2$
I can see from the webadmin of the autochanger (Quantum Superloader 3
DLT) that autochanger commands are being executed correctly, still
bacula reports "ERR=Child died from signal 15: Termination".
Nils
T. Horsnell schrieb:
Assuming your O/S is a Unix/Linux of some sort, have you tried the
basic mtx command on it?
Something like
mtx -f /dev/sg3 load 1 0
(See 'man mtx')
Cheers,
Terry
Hi all,
bacula won't work with our autochanger anymore. I can't find the
source of the problems.
Here is the output of the autochanger-test of: "btape
-c /etc/bacula/bacula-sd.conf /dev/nst0":
=== Autochanger test ===
3301 Issuing autochanger "loaded" command.
Slot 1 loaded. I am going to unload it.
3302 Issuing autochanger "unload 1 0" command.
unload status=Bad 134217743
3992 Bad autochanger command: /usr/lib/bacula/mtx-changer /dev/sg3
unload 1 /dev/nst0 0
3992 result="Unloading drive 0 into Storage Element 1...done
Program killed by Bacula watchdog (timeout)
": ERR=Child died from signal 15: Termination
3303 Issuing autochanger "load 1 0" command.
3993 Bad autochanger command: /usr/lib/bacula/mtx-changer /dev/sg3
load 1 /dev/nst0 0
3993 result="Loading media from Storage Element 1 into drive
0...done
Program killed by Bacula watchdog (timeout)
": ERR=Child died from signal 15: Termination
You must correct this error or the Autochanger will not work.
This is the storage-definition:
Autochanger {
Name = QS3DLT
Device = DLT-Drive-1
Changer Command = "/usr/lib/bacula/mtx-changer %c %o %S %a %d"
Changer Device = /dev/sg3
}
Device {
Name = DLT-Drive-1 #
Drive Index = 0
Media Type = DLT-VS1
Archive Device = /dev/nst0
AutomaticMount = yes; # when device opened, read
it
AlwaysOpen = no;
RemovableMedia = yes;
RandomAccess = no;
AutoChanger = yes
Maximum Changer Wait = 10
Maximum Rewind Wait = 10
Maximum Open Wait = 10
}
When I look at the webadmin of the autochanger I see the
autochanger and the drive perform exactly the requested operations
at the usual speed (~1:30min for an unload operation, ~3:50 for an
unload/load operation).
Still I get lots of killing / timeout problems.
I start to wonder if the autochanger is somewhat defective...
If any of you guys can help I would greatly appreciate it.
All the best, Nils
Nils Blanck-Wehde schrieb:
Hi John,
thanks for your help. I am quite new to bacula and it seems to
take some time to fully understand it :-)
I am not sure whether it really is a timeout problem.
I increased all timeout values to 10 minutes and the killing
still occurs (after 2:20min):
07-Aug 12:20 company_bacula-dir JobId 217: Start Backup JobId
217,
Job=Fileserver_Lexware_Exchange_to_Tape.2008-08-07_12.20.03
07-Aug 12:20 company_bacula-dir JobId 217: Using Device
"DLT-Drive-1"
07-Aug 12:20 company_bacula-sd JobId 217: 3301 Issuing
autochanger "loaded? drive 0" command.
07-Aug 12:20 company_bacula-sd JobId 217: 3302 Autochanger
"loaded? drive 0", result: nothing loaded.
07-Aug 12:20 company_bacula-sd JobId 217: 3304 Issuing
autochanger "load slot 8, drive 0" command.
*messages
07-Aug 12:22 company_bacula-sd JobId 217: Fatal error: 3992 Bad
autochanger "load slot 8, drive 0": ERR=Child died from signal
15: Termination.
Results=Loading media from Storage Element 8 into drive
0...done
Program killed by Bacula watchdog (timeout)
07-Aug 12:20 company-appsrv-fd JobId 217: Fatal
error: ../../filed/job.c:1817 Bad response to Append Data
command. Wanted 3000 OK data, got 3903 Error append data
I doublechecked the time needed by the autochanger for an
unload/load operation: I did a couple of operations and they all
terminated in less than four minutes.
I think there is a general problem with the autochanger because
btape test throws the following error when testing the
autochanger:
3301 Issuing autochanger "loaded" command.
3991 Bad autochanger
command: /usr/lib/bacula/mtx-changer /dev/sg3 loaded 1 /dev/nst0
0
3991 result="": ERR=Child died from signal 15: Termination
You must correct this error or the Autochanger will not work.
When I run this command "/usr/lib/bacula/mtx-changer /dev/sg3
loaded 1 /dev/nst0 0" manually, both as root and as user bacula,
it returns "1":
[root@company-Backupserver ~]# su - bacula
-bash-3.2$ /usr/lib/bacula/mtx-changer /dev/sg3 loaded
1 /dev/nst0 0
1
-bash-3.2$
Could it be a permission problem? What Uid Gid should
mtx-changer have? Could some automatic mechanism on my CentOS
installation have changed the permissions of the autochanger
device to a lower level?
I am a little stuck here.
Thanks for all help!
Nils
John Drescher schrieb:
On Wed, Aug 6, 2008 at 1:42 PM, Nils Blanck-Wehde
<nils.blanck-wehde AT backofficeservice DOT biz>
<mailto:nils.blanck-wehde AT backofficeservice DOT biz> wrote:
Hello list,
I just encountered this issue:
06-Aug 19:31 company_bacula-sd JobId 210: Fatal error: 3992
Bad
autochanger "load slot 8, drive 0": ERR=Child died from
signal 15:
Termination.
Results=Loading media from Storage Element 8 into drive
0...done
Program killed by Bacula watchdog (timeout)
Earlier today I got this message:
06-Aug 17:10 company_bacula-dir JobId 208: Using Device
"DLT-Drive-1"
06-Aug 17:10 company_bacula-sd JobId 208: 3301 Issuing
autochanger "loaded? drive 0" command.
06-Aug 17:10 company_bacula-sd JobId 208: 3302 Autochanger
"loaded? drive 0", result: nothing loaded.
06-Aug 17:10 company_bacula-sd JobId 208: 3304 Issuing
autochanger "load slot 1, drive 0" command.
06-Aug 17:13 company-appsrv-fd JobId 208: Fatal
error: ../../filed/job.c:1817 Bad response to Append Data
command. Wanted 3000 OK data
, got 3903 Error append data
06-Aug 17:15 company_bacula-sd JobId 208: Fatal error: 3992
Bad autochanger "load slot 1, drive 0": ERR=Child died from
signal 15: Termination.
Results=Loading media from Storage Element 1 into drive
0...done
Program killed by Bacula watchdog (timeout)
The default maximum changer wait is 5 minutes. If the changer
does not
complete in 5 minutes bacula will kill the mtx-changer
script.
See Maximum Changer Wait in
http://bacula.org/en/rel-manual/Storage_Daemon_Configuratio.html
John
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url="">
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url="">
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
!DSPAM:489b4fd2120211987424007!
--
BackOfficeService
-
Beratung und Service für Ihre IT -
Anschrift:
Niederkastenholzer
Str. 40
53881
Euskirchen
|
Telefon:
+49
2255 953204
Fax:
+49
2255 953208
Mobil: +49
177 3397547
|
Bankverbindung:
Raiffeisenbank
Rheinbach Voreifel eG
Kto.
Nr. 340286014 (BLZ 370 696 27)
|
Online:
info@backofficeservice.biz
www.backofficeservice.biz
|
|
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|
|
|