Bacula-users

Re: [Bacula-users] [Bacula-devel] Bacula 3.0.3 deadlock : Job is waiting for execution

2010-01-09 14:23:28
Subject: Re: [Bacula-users] [Bacula-devel] Bacula 3.0.3 deadlock : Job is waiting for execution
From: Renaud Marquet <rmarquet AT gmail DOT com>
To: Kern Sibbald <kern AT sibbald DOT com>
Date: Sat, 09 Jan 2010 20:20:01 +0100
Kern,

altough I searched for a possible workaround, I didn't found the ones
you talk about. But your statement is not correct as pointing to a valid
smtp server is not a proper workaround. Actually, if for some reason,
the *valid* smtp server is down, the problem will occur and I bet users
will not figure out the reason.

That's why I came up with this patch. It correctly fixes the problem but
I recognize this could affect performances so it should certainly not be
put in the trunk. It will even probably be useless as you pointed out
it's already fixed in developpement version.

That said, I didn't know lock manager should be turned off in production
environment. Moreover, I'm not sure I understand your point because,
although I didn't read all the code, it seems pretty strange to me that
a multithreaded application should not use any mutexes in a production
environment.

Regards,
Renaud

Le samedi 09 janvier 2010 à 00:03 +0100, Kern Sibbald a écrit :
> Hello Arno and Renaud,
> 
> I can believe that there might be a bug in the lock manager software, but I 
> am 
> very surprised that it is turned on. It should only be turned on for 
> developers, and thus though this patch may be correct (I don't think so, but 
> Eric can answer more definitively), it should never be needed in a production 
> system, and won't work in a production system because of the lock manager 
> being turned off.
> 
> Can you explain why the lock manager code is turned on?
> 
> If this is a problem with a misconfigured mail daemon, then it is very likely 
> that this problem has already shown up and has a very different solution.  
> The problem I just mentioned is fixed in the current development version, and 
> the workaround for version 3.0.x is to ensure that either email is turned off 
> or you point to a valid smtp server.
> 
> Regards,
> 
> Kern
> 
> On Friday 08 January 2010 21:32:18 Arno Lehmann wrote:
> > Hello,
> >
> > this is just forwarding your mail to bacula-devel, where it's more
> > likely to be picked up, looked at, and perhaps integrated into the
> > code base :-)
> >
> > Cheers, and thanks for not only analyzing the problem, but also
> > providing a possible fix!
> >
> > Arno
> >
> > 07.01.2010 16:34, Renaud Marquet wrote:
> > > Hi,
> > >
> > > I'm using bacula 3.0.3 and the director's job queue was stuck after
> > > running the first job. The others were waiting indefinitely for
> > > execution. If the director was restarted, I could run only one job, and
> > > so on.
> > >
> > > Googling around I found these 2 posts without satisfying anwsers :
> > > http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-l
> > >ists-3/bacula-25/upgrade-to-3-0-3-job-is-waiting-for-execution-102156/
> > > http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-l
> > >ists-3/bacula-25/job-is-waiting-for-execuition-101508/
> > >
> > > I then looked at the code and found there is a deadlock happening in
> > > message handling.
> > >
> > > The problem is located in close_msg(JCR *) function in message.c. When
> > > it encounters an error while sending an e-mail, it calls the macro Jmsg1
> > > (line 485) to report it. This macro calls dispatch_message, which tries
> > > to acquire fides_mutex (line 738). Unfortunatly, this mutex was already
> > > acquired in close_msg (line 431), thus resulting in a deadlock (as
> > > stated in mutex documentation for PTHREAD_MUTEX_INITIALIZER kind).
> > >
> > > This problem was affecting me because mail daemon was not properly
> > > configured on my server.
> > >
> > > It could be interesting to review these parts of the code to avoid such
> > > situation.
> > >
> > > However I wrote a quick patch for lockmgr.c which simply upgrades
> > > mutexes to PTHREAD_MUTEX_ERRORCHECK_NP kind and resolves this error.
> > >
> > > Hope this would help someone,
> > > Renaud
> > >
> > > patch :
> > >
> > > diff -rupN bacula-3.0.3.vanilla/src/lib/lockmgr.c
> > > bacula-3.0.3.patched/src/lib/lockmgr.c
> > > --- bacula-3.0.3.vanilla/src/lib/lockmgr.c        2009-10-18 
> > > 11:10:16.000000000
> > > +0200
> > > +++ bacula-3.0.3.patched/src/lib/lockmgr.c        2009-12-31 
> > > 18:05:59.000000000
> > > +0100
> > > @@ -616,6 +616,15 @@ void lmgr_cleanup_main()
> > >   */
> > >  int lmgr_mutex_lock(pthread_mutex_t *m, const char *file, int line)
> > >  {
> > > +   /* Patch to avoid deadlock if mutex is locked more than once */
> > > +   /* There's some performance hit which makes it probably not
> > > acceptable */
> > > +   /* for large system usage. */
> > > +   if(*m == PTHREAD_MUTEX_INITIALIZER) {
> > > +      pthread_mutexattr_t attr;
> > > +      pthread_mutexattr_settype( &attr, PTHREAD_MUTEX_ERRORCHECK_NP );
> > > +      pthread_mutex_init( m, &attr );
> > > +   }
> > > +
> > >     int ret;
> > >     lmgr_thread_t *self = lmgr_get_thread_info();
> > >     self->pre_P(m, file, line);
> > >
> > >
> > >
> > > -------------------------------------------------------------------------
> > >----- This SF.Net email is sponsored by the Verizon Developer Community
> > > Take advantage of Verizon's best-in-class app development support A
> > > streamlined, 14 day to market process makes app distribution fast and
> > > easy Join now and get one step closer to millions of Verizon customers
> > > http://p.sf.net/sfu/verizon-dev2dev
> > > _______________________________________________
> > > Bacula-users mailing list
> > > Bacula-users AT lists.sourceforge DOT net
> > > https://lists.sourceforge.net/lists/listinfo/bacula-users
> 
> 
> 
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev 
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users



------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users