Bacula-users

Re: [Bacula-users] [Bacula-devel] Storage Daemon crash backtrace

2010-07-02 11:03:38
Subject: Re: [Bacula-users] [Bacula-devel] Storage Daemon crash backtrace
From: Robert LeBlanc <robert AT leblancnet DOT us>
To: Kern Sibbald <kern AT sibbald DOT com>
Date: Fri, 2 Jul 2010 09:00:46 -0600
On Fri, Jul 2, 2010 at 2:59 AM, Kern Sibbald <kern AT sibbald DOT com> wrote:
> The question that I have is am I missing some debug symbols in other
> packages like open-ssl that would help? I'm not a programmer so backtraces
> are pretty much a wall of text to me. I want to give helpful info so that
> others may not run into the same problem into the future.
>
> If this is not helpful, I'm not sure what else to do, so I'll give up and
> just create a cron job that will restart bacula-sd if it crashes or modify
> btraceback to restart bacula-sd.
>


The dump does not clearly show what is going on.  I suspect this is because
you are not following the advice in the manual (e.g. you should not use "set
loggin"...) as it seems to only partially show what is going on.

However, if I am interpreting what you show above and what is in the log file
as being all the same output, it looks like the problems are coming because
either the operator or by a directive, a cancel command has been sent to the
SD.

In Bacula 5.0.2, cancelling jobs is known to occassionally crash the Director
and the SD.  Perhaps it happens more frequently when TLS is running.  My best
guess is that the libz routines have a signal bug, or perhaps there is a
problem in the Bacula code -- I am not sure.

I do know that we have a number of fixes for the cancel command in Bacula
5.0.3, which will probably be released near the end of the month.  Most if
not all of the fixes are in the Source Forge bacula repo under Branch-5.0.

In the mean time, you should try to find out why Bacula is attempting to
cancel the job and make sure that does not happen.  Perhaps it is a max
runtime or something that is set too short or a rogue operator :-)

I believe that your bug is a duplicate of bug #1568, which is a bug in zlib
that causes it to crash when a signal is received. You will notice that the
tracebacks look very similar to yours.

You might want to talk to Frank Sweetzer about how he is resolving the
problem.  He is also at a University ...


I think this is helpful for me. Debian does run bacula under bacula.tape, I'll change it to run under root.root and see if that helps with the automated backtrace. I do think there is some sort of error in the SSL, and the problem may be compounded by the cancel bug, here's why:

I was able to test this on a machine that was not able to get a good backup. When running a TLS job, the connection is established and the FD starts transferring data to the SD. I watch as the spool size increments and when it stops, I look on the client and the SEND-Q in netstat for the connection to the SD starts incrementing. 30 minutes later, I get "Connection times out", and then the job is canceled (not put in error state). (Disabling TLS allowed the client to complete the back-up on the first try).

When I get a "Broken pipe", then bacula puts the job in error state, but connection timed out is always canceled. I think this may be triggering the crash. I'll pull head and see if it runs into the same problem. I'm afraid that you might be right about the SSL bug and it is definitely out of your hands. I'll see what I can do to submit a bug to openSSL about it.
 
Robert LeBlanc
Life Sciences & Undergraduate Education Computer Support
Brigham Young University


------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users