Bacula-users

[Bacula-users] Bacula 3.0.3 deadlock : Job is waiting for execution

2010-01-07 10:38:41
Subject: [Bacula-users] Bacula 3.0.3 deadlock : Job is waiting for execution
From: Renaud Marquet <rmarquet AT gmail DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Thu, 07 Jan 2010 16:34:28 +0100
Hi,

I'm using bacula 3.0.3 and the director's job queue was stuck after
running the first job. The others were waiting indefinitely for
execution. If the director was restarted, I could run only one job, and
so on.

Googling around I found these 2 posts without satisfying anwsers :
http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-lists-3/bacula-25/upgrade-to-3-0-3-job-is-waiting-for-execution-102156/
http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-lists-3/bacula-25/job-is-waiting-for-execuition-101508/

I then looked at the code and found there is a deadlock happening in
message handling.

The problem is located in close_msg(JCR *) function in message.c. When
it encounters an error while sending an e-mail, it calls the macro Jmsg1
(line 485) to report it. This macro calls dispatch_message, which tries
to acquire fides_mutex (line 738). Unfortunatly, this mutex was already
acquired in close_msg (line 431), thus resulting in a deadlock (as
stated in mutex documentation for PTHREAD_MUTEX_INITIALIZER kind).

This problem was affecting me because mail daemon was not properly
configured on my server.

It could be interesting to review these parts of the code to avoid such
situation.

However I wrote a quick patch for lockmgr.c which simply upgrades
mutexes to PTHREAD_MUTEX_ERRORCHECK_NP kind and resolves this error.

Hope this would help someone,
Renaud

patch :

diff -rupN bacula-3.0.3.vanilla/src/lib/lockmgr.c
bacula-3.0.3.patched/src/lib/lockmgr.c
--- bacula-3.0.3.vanilla/src/lib/lockmgr.c      2009-10-18 11:10:16.000000000
+0200
+++ bacula-3.0.3.patched/src/lib/lockmgr.c      2009-12-31 18:05:59.000000000
+0100
@@ -616,6 +616,15 @@ void lmgr_cleanup_main()
  */
 int lmgr_mutex_lock(pthread_mutex_t *m, const char *file, int line)
 {
+   /* Patch to avoid deadlock if mutex is locked more than once */
+   /* There's some performance hit which makes it probably not
acceptable */
+   /* for large system usage. */   
+   if(*m == PTHREAD_MUTEX_INITIALIZER) {
+      pthread_mutexattr_t attr;
+      pthread_mutexattr_settype( &attr, PTHREAD_MUTEX_ERRORCHECK_NP );
+      pthread_mutex_init( m, &attr );
+   }
+
    int ret;
    lmgr_thread_t *self = lmgr_get_thread_info();
    self->pre_P(m, file, line);



------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users