Bacula-users

Re: [Bacula-users] [Bacula-devel] Bacula 3.0.3 deadlock : Job is waiting for execution

2010-01-09 16:25:09
Subject: Re: [Bacula-users] [Bacula-devel] Bacula 3.0.3 deadlock : Job is waiting for execution
From: Renaud Marquet <rmarquet AT gmail DOT com>
To: Kern Sibbald <kern AT sibbald DOT com>
Date: Sat, 09 Jan 2010 22:22:05 +0100
Le samedi 09 janvier 2010 à 21:25 +0100, Kern Sibbald a écrit :
> Hello,
> 
> On Saturday 09 January 2010 20:20:01 Renaud Marquet wrote:
> > Kern,
> >
> > altough I searched for a possible workaround, I didn't found the ones
> > you talk about. But your statement is not correct as pointing to a valid
> > smtp server is not a proper workaround. Actually, if for some reason,
> > the *valid* smtp server is down, the problem will occur and I bet users
> > will not figure out the reason.
> 
> I never claimed that my suggestion was a "proper" workaround nor that it was 
> a 
> fix.  It is a workaround.  

Nevermind then ;)

> 
> If you want, you can backport the fixes (applied 23 October 2009), but since 
> we are close to release, and we have a workaround, we are not planning to 
> backport them.

No need to backport. This is not a 'blocker' problem, I just mailed here
in case someone else run into the same problem because there wasn't any
answer when googling. Bacula now runs perfectly fine on my system, so I
can wait for the upcoming release without any trouble.

> 
> >
> > That's why I came up with this patch. It correctly fixes the problem but
> > I recognize this could affect performances so it should certainly not be
> > put in the trunk. It will even probably be useless as you pointed out
> > it's already fixed in developpement version.
> 
> Unfortunately your patch does not fix the problem -- it masks the problem.  I 
> didn't look at your patch in detail, but I believe that it will make all 
> locks recursive, which is not really what we want and may lead to some 
> surprises.  
> 
> Bacula does have recursive locks, but we use them only in situations where 
> they need to be used and they are portable.  I am not so much worried about 
> the performance consequences of your patch, but your code is Linux only if I 
> am not mistaken (i.e. not portable), and as I said, the lock manager is not 
> production code.  It is development should only be turned on for developer's 
> for debugging.

As I said in another mail, I didn't do anything to activate this lock
manager, so I guess it's not. I think the confusion come from the fact
mutexes are handled through some functions in lockmgr.c (through a
macro), I think even with lock manager deactivated.

> 
> >
> > That said, I didn't know lock manager should be turned off in production
> > environment. Moreover, I'm not sure I understand your point because,
> > although I didn't read all the code, it seems pretty strange to me that
> > a multithreaded application should not use any mutexes in a production
> > environment.
> 
> We use mutexes in production as in development.  The lock manager "watches" 
> our lock usage and blows up Bacula if it detects a problem (deadlock, out of 
> order locks, ...).  It is a debug tool and not meant or sufficently tested 
> for production use.  Use it at your own risk.
> 
> That said, you were very clever to figure out the problem. Not many users 
> could do so.

Thank you,
Regards.

> 
> Regards,
> 
> Kern
> 
> >
> > Regards,
> > Renaud
> >
> > Le samedi 09 janvier 2010 à 00:03 +0100, Kern Sibbald a écrit :
> > > Hello Arno and Renaud,
> > >
> > > I can believe that there might be a bug in the lock manager software, but
> > > I am very surprised that it is turned on. It should only be turned on for
> > > developers, and thus though this patch may be correct (I don't think so,
> > > but Eric can answer more definitively), it should never be needed in a
> > > production system, and won't work in a production system because of the
> > > lock manager being turned off.
> > >
> > > Can you explain why the lock manager code is turned on?
> > >
> > > If this is a problem with a misconfigured mail daemon, then it is very
> > > likely that this problem has already shown up and has a very different
> > > solution. The problem I just mentioned is fixed in the current
> > > development version, and the workaround for version 3.0.x is to ensure
> > > that either email is turned off or you point to a valid smtp server.
> > >
> > > Regards,
> > >
> > > Kern
> > >
> > > On Friday 08 January 2010 21:32:18 Arno Lehmann wrote:
> > > > Hello,
> > > >
> > > > this is just forwarding your mail to bacula-devel, where it's more
> > > > likely to be picked up, looked at, and perhaps integrated into the
> > > > code base :-)
> > > >
> > > > Cheers, and thanks for not only analyzing the problem, but also
> > > > providing a possible fix!
> > > >
> > > > Arno
> > > >
> > > > 07.01.2010 16:34, Renaud Marquet wrote:
> > > > > Hi,
> > > > >
> > > > > I'm using bacula 3.0.3 and the director's job queue was stuck after
> > > > > running the first job. The others were waiting indefinitely for
> > > > > execution. If the director was restarted, I could run only one job,
> > > > > and so on.
> > > > >
> > > > > Googling around I found these 2 posts without satisfying anwsers :
> > > > > http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-maili
> > > > >ng-l
> > > > > ists-3/bacula-25/upgrade-to-3-0-3-job-is-waiting-for-execution-102156
> > > > >/
> > > > > http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-maili
> > > > >ng-l ists-3/bacula-25/job-is-waiting-for-execuition-101508/
> > > > >
> > > > > I then looked at the code and found there is a deadlock happening in
> > > > > message handling.
> > > > >
> > > > > The problem is located in close_msg(JCR *) function in message.c.
> > > > > When it encounters an error while sending an e-mail, it calls the
> > > > > macro Jmsg1 (line 485) to report it. This macro calls
> > > > > dispatch_message, which tries to acquire fides_mutex (line 738).
> > > > > Unfortunatly, this mutex was already acquired in close_msg (line
> > > > > 431), thus resulting in a deadlock (as stated in mutex documentation
> > > > > for PTHREAD_MUTEX_INITIALIZER kind).
> > > > >
> > > > > This problem was affecting me because mail daemon was not properly
> > > > > configured on my server.
> > > > >
> > > > > It could be interesting to review these parts of the code to avoid
> > > > > such situation.
> > > > >
> > > > > However I wrote a quick patch for lockmgr.c which simply upgrades
> > > > > mutexes to PTHREAD_MUTEX_ERRORCHECK_NP kind and resolves this error.
> > > > >
> > > > > Hope this would help someone,
> > > > > Renaud
> > > > >
> > > > > patch :
> > > > >
> > > > > diff -rupN bacula-3.0.3.vanilla/src/lib/lockmgr.c
> > > > > bacula-3.0.3.patched/src/lib/lockmgr.c
> > > > > --- bacula-3.0.3.vanilla/src/lib/lockmgr.c    2009-10-18
> > > > > 11:10:16.000000000 +0200
> > > > > +++ bacula-3.0.3.patched/src/lib/lockmgr.c    2009-12-31
> > > > > 18:05:59.000000000 +0100
> > > > > @@ -616,6 +616,15 @@ void lmgr_cleanup_main()
> > > > >   */
> > > > >  int lmgr_mutex_lock(pthread_mutex_t *m, const char *file, int line)
> > > > >  {
> > > > > +   /* Patch to avoid deadlock if mutex is locked more than once */
> > > > > +   /* There's some performance hit which makes it probably not
> > > > > acceptable */
> > > > > +   /* for large system usage. */
> > > > > +   if(*m == PTHREAD_MUTEX_INITIALIZER) {
> > > > > +      pthread_mutexattr_t attr;
> > > > > +      pthread_mutexattr_settype( &attr, PTHREAD_MUTEX_ERRORCHECK_NP
> > > > > ); +      pthread_mutex_init( m, &attr );
> > > > > +   }
> > > > > +
> > > > >     int ret;
> > > > >     lmgr_thread_t *self = lmgr_get_thread_info();
> > > > >     self->pre_P(m, file, line);
> > > > >
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > >---- ----- This SF.Net email is sponsored by the Verizon Developer
> > > > > Community Take advantage of Verizon's best-in-class app development
> > > > > support A streamlined, 14 day to market process makes app
> > > > > distribution fast and easy Join now and get one step closer to
> > > > > millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev
> > > > > _______________________________________________
> > > > > Bacula-users mailing list
> > > > > Bacula-users AT lists.sourceforge DOT net
> > > > > https://lists.sourceforge.net/lists/listinfo/bacula-users
> > >
> > > -------------------------------------------------------------------------
> > >----- This SF.Net email is sponsored by the Verizon Developer Community
> > > Take advantage of Verizon's best-in-class app development support A
> > > streamlined, 14 day to market process makes app distribution fast and
> > > easy Join now and get one step closer to millions of Verizon customers
> > > http://p.sf.net/sfu/verizon-dev2dev
> > > _______________________________________________
> > > Bacula-users mailing list
> > > Bacula-users AT lists.sourceforge DOT net
> > > https://lists.sourceforge.net/lists/listinfo/bacula-users
> 
> 



------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users