Bacula-users

Re: [Bacula-users] Full backups keep failing: Network send error to SD

2008-04-08 05:42:48
Subject: Re: [Bacula-users] Full backups keep failing: Network send error to SD
From: Arno Lehmann <al AT its-lehmann DOT de>
To: bacula-users AT lists.sourceforge DOT net
Date: Tue, 08 Apr 2008 11:42:22 +0200
Hi,

08.04.2008 11:28, Tore Anderson wrote:
> * Tore Anderson
> 
>> Okay, I'll try to run both the SD and the DIR in debug mode, with a 
>> tcpdump running on the loopback interface.  Hopefully I'll get
>> another aborted job that'll tell me more.  Might not be able to do so
>> before the weekend though.
> 
> I've found out a bit more.  It seems bacula-sd segfaults, which explains
> why all the directors gets their connections reset.  The backtrace is
> attached.

Good...

> Thread 2 (Thread 1098918240 (LWP 4945)):
> #0  0x00002aaaaace40ca in waitpid () from /lib/libpthread.so.0
> #1  0x0000000000446f91 in signal_handler (sig=11) at signal.c:167
> #2  <signal handler called>
> #3  0x00000000004076c8 in detach_dcr_from_dev (dcr=0x574d38) at acquire.c:693

Looks like it *could* be related to some changes Kern recently made.

I'd recommend to upgrade (the SD at least) to the latest beta version 
and re-try. I assume that it's possible to only use the newer SD, so 
you don't have to upgrade your whole setup.

If you compile from source, use 'make' to create all the programs and 
simply run the newer SD instead of the regularly installed one.

> #4  0x00000000004076f2 in free_dcr (dcr=0x44f0f6) at acquire.c:714
> #5  0x000000000042b779 in despool_data (dcr=0x580e28, commit=true) at 
> spool.c:336
> #6  0x000000000042c15a in commit_data_spool (dcr=0x580e28) at spool.c:139
> #7  0x0000000000409554 in do_append_data (jcr=0x57fd68) at append.c:318
> #8  0x000000000041ae34 in append_data_cmd (jcr=0x57fd68) at fd_cmds.c:194
> #9  0x000000000041acd6 in do_fd_commands (jcr=0x57fd68) at fd_cmds.c:165
> #10 0x000000000041b815 in run_job (jcr=0x57fd68) at fd_cmds.c:128
> #11 0x000000000041ba15 in run_cmd (jcr=0x57fd68) at job.c:210
> #12 0x0000000000416bd7 in handle_connection_request (arg=<value optimized 
> out>) at dircmd.c:229
> #13 0x000000000044a04d in workq_server (arg=<value optimized out>) at 
> workq.c:357
> #14 0x00002aaaaacde0fa in start_thread () from /lib/libpthread.so.0
> #15 0x00002aaaab344ce2 in clone () from /lib/libc.so.6
> #16 0x0000000000000000 in ?? ()

In any case, this is something worth a bug report. But I'd first check 
with the latest version.


>  This happened with two incremental backups running, dumping
> to file-based storage.
> 
> I got this error while running bacula-sd with debug-level 1000, but that 
> debug log doesn't tell me anything interesting,

Me neither...

...
>  dump-sd: append.c:252-0 Enter bnet_get
>  <<END>>
> 
> So it appears the crash happens after «Enter bnet_get» but before the
> «before writ_rec» message would have been printed.
> 
> I have tcpdumps for all network traffic from when the crash happened,
> too, but they're a bit too large for posting here.

I don't think those will help, but keep them - I'm not the one to 
understand those bugs :-)

> Any idea of what could cause this?

Coding error, hardware error, compiler error, linker error ;-) Most 
probably it's a real bug, if you can reliably reproduce it.

Arno

> Regards
> 
> 
> ------------------------------------------------------------------------
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Register now and save $200. Hurry, offer ends at 11:59 p.m., 
> Monday, April 7! Use priority code J8TLD2. 
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-- 
Arno Lehmann
IT-Service Lehmann
www.its-lehmann.de

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer ends at 11:59 p.m., 
Monday, April 7! Use priority code J8TLD2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users