Bacula-users

[Bacula-users] Jobs not completing, but not erroring?

2009-02-18 16:06:57
Subject: [Bacula-users] Jobs not completing, but not erroring?
From: Mingus Dew <shon.stephens AT gmail DOT com>
To: bacula-users <bacula-users AT lists.sourceforge DOT net>
Date: Wed, 18 Feb 2009 16:04:04 -0500
Hi all,
     Been using Bacula 2.4.2 on Solaris 10_x86 for almost 2 years now. Recently tape backups have been entering into a state that I can only describe as "limbo".

If I check the status of the director, I may see something like

Running Jobs:
 JobId Level   Name                       Status
======================================================================
 22649 Increme  RMAN_A_Lvl1_Tape.2009-02-17_13.30.36 is running
 22650 Increme  RMAN_B_Lvl1_Tape.2009-02-17_13.30.38 is waiting on max Storage jobs
 22651 Increme  RMAN_PROD_Lvl1_Tape.2009-02-17_14.00.40 is waiting on max Storage jobs
 22652 Increme  RMAN_BI_Lvl1_Tape.2009-02-17_14.00.42 is waiting on max Storage jobs
 22653 Increme  RMAN_COG_Lvl1_Tape.2009-02-17_14.00.44 is waiting on max Storage jobs

If I check the status of the running jobid or the tape device, it will show this:

Used Volume status:
B00046 on device "Ultrium-TD3" (/dev/rmt/0cbn)
    Reader=0 writers=0 devres=0 volinuse=1
====

Data spooling: 0 active jobs, 0 bytes; 80 total jobs, 47,799,329,608 max bytes/job.
Attr spooling: 0 active jobs, 0 bytes; 80 total jobs, 40,616 max bytes.

Basically, tape is mounted and reserved, job is showing a "is running" status, but nothing is happening. Because I lack any monitoring of how long jobs have been running,
these have sat for as many as 3 days without changing status, erroring, or completing. This backs up subsequent jobs that have been waiting for the tape device.
The only commonality that I've seen is that they are tape jobs. Other than that, the level, fileset, etc. are different.

On one occasion when I cancelled one of these long running jobs, I got an error

Hostname    : BUG!
Date    : 2009-02-11 14:00:30
Severity    : err

unregister_watchdog_unlocked called before start_watchdog


Hostname    : BUG!
Date    : 2009-02-11 14:00:30
Severity    : err

bacula-dir[20200]: [ID 702911 daemon.error] backup4.director: ABORTING due to ERROR in watchdog.c:206

If anyone has any advice on what might be happening, I would really appreciate your responses.

Thank you,
Shon



------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users