Bacula-users

[Bacula-users] RunAfterJob to eject tape works on successful jobs, but hangs bacula forever on cancelled jobs

2012-10-01 13:34:38
Subject: [Bacula-users] RunAfterJob to eject tape works on successful jobs, but hangs bacula forever on cancelled jobs
From: "Jesse Osby" <josby AT allegroconsultants DOT com>
To: <bacula-users AT lists.sourceforge DOT net>
Date: Mon, 1 Oct 2012 13:18:44 -0400

I’m actually having this same problem on two different bacula setups for two different clients.  Details are:

 

Bacula 5.0.3 on CentOS 5.3 x86 -- HP StorageWorks 8-slot LTO3 changer

Bacula 5.2.12 on CentOS 5.8 x86 -- Dell PV-124T 16-slot LTO3 changer

 

I’ll give specifics on the 1st setup, though the identical problem is occurring on both.  We want all tapes ejected back into the changer after each job so that they are available for the person who performs the tape rotations.  Both RunAfterJob and RunAfterFailedJob directives on the job are "/etc/bacula/unmount-tape", the contents of which are:

 

/usr/sbin/bconsole -c /etc/bacula/bconsole.conf < /etc/bacula/unmount-tape.internal-commands

 

The text in that commands file reads “release StorageWorks”.    When the backup job succeeds, I’ll see something like this in the report email I get:

 

27-Sep 18:47 bacula.servers.richmond.alleg JobId 13844: shell command: run AfterJob "/etc/bacula/unmount-tape"

27-Sep 18:47 bacula.servers.richmond.alleg JobId 13844: AfterJob: Connecting to Director bacula:9101

27-Sep 18:47 bacula.servers.richmond.alleg JobId 13844: AfterJob: 1000 OK: bacula.servers.richmond.alleg Version: 5.0.3 (30 August 2010)

27-Sep 18:47 bacula.servers.richmond.alleg JobId 13844: AfterJob: Enter a period to cancel a command.

27-Sep 18:47 bacula.servers.richmond.alleg JobId 13844: AfterJob: release StorageWorks

27-Sep 18:47 bacula.servers.richmond.alleg JobId 13844: AfterJob: Automatically selected Catalog: MyCatalog

27-Sep 18:47 bacula.servers.richmond.alleg JobId 13844: AfterJob: Using Catalog "MyCatalog"

27-Sep 18:47 bacula.servers.richmond.alleg JobId 13844: AfterJob: 3307 Issuing autochanger "unload slot 1, drive 0" command.

27-Sep 18:49 bacula.servers.richmond.alleg JobId 13844: AfterJob: 3002 Device "LTO-3" (/dev/nst0) unmounted.

27-Sep 18:49 bacula.servers.richmond.alleg JobId 13844: AfterJob: You have messages

 

Everything works as expected.  However, if the backup job gets cancelled, either manually or due to exceeding its max runtime, I never receive the email at all.  A week may go by before I notice and log in to the server to see what’s up.  What I find if I run “messages” in bconsole is something like this:

 

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: shell command: run AfterJob "/etc/bacula/unmount-tape"

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: AfterJob: Connecting to Director bacula:9101

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: AfterJob: 1000 OK: bacula.servers.richmond.alleg Version: 5.0.3 (30 August 2010)

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: AfterJob: Enter a period to cancel a command.

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: AfterJob: release StorageWorks

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: AfterJob: Automatically selected Catalog: MyCatalog

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: AfterJob: Using Catalog "MyCatalog"

*

 

And then no more messages, indefinitely.  If I look at the process list for bacula processes, I see this:

 

30729 ?        S      0:00 /bin/bash /etc/bacula/unmount-tape

30730 ?        Sl     0:00 /usr/sbin/bconsole -c /etc/bacula/bconsole.conf

 

And it’s just stuck there forever.  Oddly, if I do “kill 30730” and then keep watching the process list, I’ll see something like this go by:

 

1861 ?        S      0:00 /bin/sh /usr/lib/bacula/mtx-changer /dev/sg2 unload 4

1867 ?        S      0:00 /usr/sbin/mtx -f /dev/sg2 unload 4 0

 

And then tape finally ejects and the job failure notification email finally goes out to me.  The last few lines of that email will be like:

 

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: AfterJob: Automatically selected Catalog: MyCatalog

01-Oct 12:00 bacula.servers.richmond.alleg JobId 13906: AfterJob: Using Catalog "MyCatalog"

01-Oct 12:38 bacula.servers.richmond.alleg JobId 13906: Error: Runscript: AfterJob returned non-zero status=1. ERR=Child exited with code 1

 

So it seems the storage daemon is in fact getting the word that it needs to unload the changer, but it  never actually does it until I kill that bconsole process that had told it to do so.

 

Here are the lines from my bacula-sd.conf for the tape changer:

 

Device {

  Name = LTO-3

  Archive Device = /dev/nst0

  Device Type = Tape

  Media Type = LTO-3

  AutoChanger = yes

  RemovableMedia = yes;

  RandomAccess = no;

  Requires mount = no;

  Maximum File Size = 4GB

  Drive Index = 0

}

Autochanger {

  Name = StorageWorks

  Device = LTO-3

  Changer Command = "/usr/lib/bacula/mtx-changer %c %o %S %a %d"

  Changer Device = /dev/sg2

}

 

 

Any ideas?  The fact that it works fine whenever the backup job completes normally is so odd.  I’m pretty stumped.

 

Thanks

 

------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>
  • [Bacula-users] RunAfterJob to eject tape works on successful jobs, but hangs bacula forever on cancelled jobs, Jesse Osby <=