Bacula-users

Re: [Bacula-users] [Bacula-devel] Storage Daemon crash backtrace

2010-07-02 05:04:18
Subject: Re: [Bacula-users] [Bacula-devel] Storage Daemon crash backtrace
From: Kern Sibbald <kern AT sibbald DOT com>
To: Robert LeBlanc <robert AT leblancnet DOT us>
Date: Fri, 2 Jul 2010 10:59:13 +0200
On Friday 02 July 2010 06:02:10 Robert LeBlanc wrote:
> On Wed, Jun 30, 2010 at 8:35 AM, Robert LeBlanc <robert AT leblancnet DOT 
> us>wrote:
> > On Wed, Jun 30, 2010 at 1:06 AM, Kern Sibbald <kern AT sibbald DOT com> 
> > wrote:
> >> This seems to a support issue.  The dump that you posted shows no
> >> indication
> >> of a crash, which means that your understanding of a crash an mine are
> >> different.
> >>
> >> This is possibly a deadlock, but I won't spend any more time on it until
> >> the
> >> problem is a bit clearer.
> >>
> >> Best regards,
> >>
> >> Kern
> >>
> >> By the way, if this is a production system, you should be running on
> >> Lenny,
> >> which is known to be stable, and we support it.
> >
> > I'm not really sure what you need as a good backtrace, since I'm not a
> > programmer. I always thought that segfault lead to a program crashing. I
> > just don't know enough about gdb to know when there is enough
> > information. All I know is that when it crashes when running as a daemon,
> > I get a traceback that is useless in my e-mail (says no ptrace). When I
> > run it under gdb and get the segfault, when I type 'cont' it says that
> > bacula-sd has exited, and when I run it again, it doesn't complain that a
> > process is already running. In both cases, there is no process called
> > bacula-sd running on the system.
> >
> > I updated/upgraded about 10 clients yesterday to using TLS, and I did not
> > get a crash from the SD. I will keep running it under the debugger in
> > case it crashes again, although, I'm not sure how useful it will be if I
> > can not operate gdb correctly to get you anything helpful. I have a
> > feeling it's some perfect storm of configuration that may be causing the
> > issue. I've been running Bacula for 6 years and never have had a problem
> > like this. I'm just trying to help the project be as robust as possible
> > because we like it and it has treated us so well in the past.
> >
> > As a side note, I get a lot more connection timeouts and broken pipes
> > when using TLS, adding heartbeat interval helps, but it is not a silver
> > bullet. Most of the back-ups are succeeding with only a few here and
> > there having problems. Not using TLS and not having heartbeat interval,
> > the back-ups aways succeed. I'll keep working through things and see if I
> > can come up with anything.
> >
> > Thank you for the time and the great project.
> >
> >
> > Robert LeBlanc
> > Life Sciences & Undergraduate Education Computer Support
> > Brigham Young University
> >
> > P.S. We are working on a support contract and will be talking with you in
> > about 24 hours with many others from our group who are also interested in
> > using Bacula.

OK.  Thanks for your interest in Bacula.  I am sorry to see that you are 
having problems. Too bad Bacula isn't working as well as it should for you.

A support contract would help as then we can spend some serious time digging 
into the details of what is going wrong with zlib, and maybe we can get to a 
fix or at least a workaround.  Even with a support contract, I have the 
feeling that this may not be so easy or fast to fix -- read below ...

>
> I know you are probably getting tired of hearing from me, but I had another
> crash today. 

Not really, I am just concerned that we haven't clearly seen the problem.  
However, with what you have inline below and what is in the attachment, I 
think I see what is going on -- see below.

> I'm attaching the backtrace that I got this time. I typed 
> 'cont' after the backtrace and all it said was that all the threads exited
> (this is in the log this time). Here is what was before the back trace:
>
> [Thread 0x7fffebfff710 (LWP 25670) exited]
> [New Thread 0x7fffebfff710 (LWP 25671)]
> [Thread 0x7fffebfff710 (LWP 25671) exited]
> [Thread 0x7ffff0e88710 (LWP 24428) exited]
> [Thread 0x7ffff1e8a710 (LWP 25530) exited]
> [Thread 0x7ffff2e8c710 (LWP 25663) exited]
> [New Thread 0x7ffff2e8c710 (LWP 25785)]
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff2e8c710 (LWP 25785)]
> 0x00007ffff77c5b1c in ?? () from /usr/lib/libz.so.1
> (gdb) set loggin file /home/rleblanc/bacula-sd-seg.log
> (gdb) set logging on
> Copying output to /home/rleblanc/bacula-sd-seg.log.
> (gdb) thread apply all bt
>
> Thread 219 (Thread 0x7ffff2e8c710 (LWP 25785)):
> #0  0x00007ffff77c5b1c in ?? () from /usr/lib/libz.so.1
> #1  0x00007ffff77c6ef7 in ?? () from /usr/lib/libz.so.1
> #2  0x00007ffff77c40eb in ?? () from /usr/lib/libz.so.1
> #3  0x00007ffff77c2251 in deflate () from /usr/lib/libz.so.1
> #4  0x00007ffff5eea6f2 in ?? () from /usr/lib/libcrypto.so.0.9.8
>
> The question that I have is am I missing some debug symbols in other
> packages like open-ssl that would help? I'm not a programmer so backtraces
> are pretty much a wall of text to me. I want to give helpful info so that
> others may not run into the same problem into the future.
>
> If this is not helpful, I'm not sure what else to do, so I'll give up and
> just create a cron job that will restart bacula-sd if it crashes or modify
> btraceback to restart bacula-sd.
>


The dump does not clearly show what is going on.  I suspect this is because 
you are not following the advice in the manual (e.g. you should not use "set 
loggin"...) as it seems to only partially show what is going on.

However, if I am interpreting what you show above and what is in the log file 
as being all the same output, it looks like the problems are coming because 
either the operator or by a directive, a cancel command has been sent to the 
SD.  

In Bacula 5.0.2, cancelling jobs is known to occassionally crash the Director 
and the SD.  Perhaps it happens more frequently when TLS is running.  My best 
guess is that the libz routines have a signal bug, or perhaps there is a 
problem in the Bacula code -- I am not sure.  

I do know that we have a number of fixes for the cancel command in Bacula 
5.0.3, which will probably be released near the end of the month.  Most if 
not all of the fixes are in the Source Forge bacula repo under Branch-5.0.

In the mean time, you should try to find out why Bacula is attempting to 
cancel the job and make sure that does not happen.  Perhaps it is a max 
runtime or something that is set too short or a rogue operator :-)

I believe that your bug is a duplicate of bug #1568, which is a bug in zlib 
that causes it to crash when a signal is received. You will notice that the 
tracebacks look very similar to yours.

You might want to talk to Frank Sweetzer about how he is resolving the 
problem.  He is also at a University ...

Best regards,

Kern

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users