Bacula-users

Re: [Bacula-users] [Bacula-devel] Storage Daemon crash backtrace

2010-07-02 11:18:46
Subject: Re: [Bacula-users] [Bacula-devel] Storage Daemon crash backtrace
From: Kern Sibbald <kern AT sibbald DOT com>
To: Robert LeBlanc <robert AT leblancnet DOT us>
Date: Fri, 2 Jul 2010 17:17:08 +0200
Hello Robert,

Eric and I "finished" Bacula Enterprise version 4.0.0 today, a bit faster than 
I expected, so I am not running all the final tests, which gave me some time 
to look at the problem.  

I downloaded the zlib source code, and I don't immediately see anything in the 
file that would cause problems -- of course it is quite complicated code.

I did look through the Bacula TLS code, and I noticed that the author did not 
properly set error conditions in Bacula when it finds an error on the comm 
line.  This could cause Bacula to continue running, and might cause 
subsequent calls to openssl subroutines, when there is no valid data, and 
thus the seg fault.  I still must test the changes I made.

It is rather a long shot, but if you see that everytime that the SD crashes it 
is when there is a disrupted comm line problem, then it could well be the 
problem -- of course, if one has a good solid network, there should never be 
any "broken pipe" errors, which is possibly why we cannot see the problem.

Having said this, I cannot rule out a problem on openssl at this point.


Best regards,

Kern

On Friday 02 July 2010 17:00:46 Robert LeBlanc wrote:
> On Fri, Jul 2, 2010 at 2:59 AM, Kern Sibbald <kern AT sibbald DOT com> wrote:
> > > The question that I have is am I missing some debug symbols in other
> > > packages like open-ssl that would help? I'm not a programmer so
> >
> > backtraces
> >
> > > are pretty much a wall of text to me. I want to give helpful info so
> > > that others may not run into the same problem into the future.
> > >
> > > If this is not helpful, I'm not sure what else to do, so I'll give up
> > > and just create a cron job that will restart bacula-sd if it crashes or
> >
> > modify
> >
> > > btraceback to restart bacula-sd.
> >
> > The dump does not clearly show what is going on.  I suspect this is
> > because you are not following the advice in the manual (e.g. you should
> > not use "set
> > loggin"...) as it seems to only partially show what is going on.
> >
> > However, if I am interpreting what you show above and what is in the log
> > file
> > as being all the same output, it looks like the problems are coming
> > because either the operator or by a directive, a cancel command has been
> > sent to the
> > SD.
> >
> > In Bacula 5.0.2, cancelling jobs is known to occassionally crash the
> > Director
> > and the SD.  Perhaps it happens more frequently when TLS is running.  My
> > best
> > guess is that the libz routines have a signal bug, or perhaps there is a
> > problem in the Bacula code -- I am not sure.
> >
> > I do know that we have a number of fixes for the cancel command in Bacula
> > 5.0.3, which will probably be released near the end of the month.  Most
> > if not all of the fixes are in the Source Forge bacula repo under
> > Branch-5.0.
> >
> > In the mean time, you should try to find out why Bacula is attempting to
> > cancel the job and make sure that does not happen.  Perhaps it is a max
> > runtime or something that is set too short or a rogue operator :-)
> >
> > I believe that your bug is a duplicate of bug #1568, which is a bug in
> > zlib that causes it to crash when a signal is received. You will notice
> > that the tracebacks look very similar to yours.
> >
> > You might want to talk to Frank Sweetzer about how he is resolving the
> > problem.  He is also at a University ...
>
> I think this is helpful for me. Debian does run bacula under bacula.tape,
> I'll change it to run under root.root and see if that helps with the
> automated backtrace. I do think there is some sort of error in the SSL, and
> the problem may be compounded by the cancel bug, here's why:
>
> I was able to test this on a machine that was not able to get a good
> backup. When running a TLS job, the connection is established and the FD
> starts transferring data to the SD. I watch as the spool size increments
> and when it stops, I look on the client and the SEND-Q in netstat for the
> connection to the SD starts incrementing. 30 minutes later, I get
> "Connection times out", and then the job is canceled (not put in error
> state). (Disabling TLS allowed the client to complete the back-up on the
> first try).
>
> When I get a "Broken pipe", then bacula puts the job in error state, but
> connection timed out is always canceled. I think this may be triggering the
> crash. I'll pull head and see if it runs into the same problem. I'm afraid
> that you might be right about the SSL bug and it is definitely out of your
> hands. I'll see what I can do to submit a bug to openSSL about it.
>
> Robert LeBlanc
> Life Sciences & Undergraduate Education Computer Support
> Brigham Young University



------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users