Bacula-users

Re: [Bacula-users] Full backups keep failing: Network send error to SD

2008-04-08 05:28:47
Subject: Re: [Bacula-users] Full backups keep failing: Network send error to SD
From: Tore Anderson <tore AT linpro DOT no>
To: bacula-users AT lists.sourceforge DOT net
Date: Tue, 08 Apr 2008 11:28:33 +0200
* Tore Anderson

> Okay, I'll try to run both the SD and the DIR in debug mode, with a 
> tcpdump running on the loopback interface.  Hopefully I'll get
> another aborted job that'll tell me more.  Might not be able to do so
> before the weekend though.

I've found out a bit more.  It seems bacula-sd segfaults, which explains
why all the directors gets their connections reset.  The backtrace is
attached.  This happened with two incremental backups running, dumping
to file-based storage.

I got this error while running bacula-sd with debug-level 1000, but that 
debug log doesn't tell me anything interesting, it loops with similar
messages for quite a while (included only the last few for context
here), then silence:

 [....]
 dump-sd: append.c:232-0 write_record FI=62 SessId=1 Strm=SPARSE-GZIP len=50822
 dump-sd: append.c:252-0 Enter bnet_get
 dump-sd: append.c:213-0 before writ_rec FI=62 SessId=1 Strm=SPARSE-GZIP 
len=52063
 dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1 
Strm=SPARSE-GZIP len=52063
 rem=28066 remainder=0
 dump-sd: append.c:217-0 !write_record_to_block data_len=52063 rem=24009
 dump-sd: spool.c:471-0 Wrote block FI=62 LI=62
 dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1 
Strm=SPARSE-GZIP len=52063
 rem=64488 remainder=24009
 dump-sd: append.c:232-0 write_record FI=62 SessId=1 Strm=SPARSE-GZIP len=52063
 dump-sd: append.c:252-0 Enter bnet_get
 dump-sd: append.c:213-0 before writ_rec FI=62 SessId=1 Strm=SPARSE-GZIP 
len=49093
 dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1 
Strm=SPARSE-GZIP len=49093
 rem=40467 remainder=0
 dump-sd: append.c:217-0 !write_record_to_block data_len=49093 rem=8638
 dump-sd: spool.c:471-0 Wrote block FI=62 LI=62
 dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1 
Strm=SPARSE-GZIP len=49093
 rem=64488 remainder=8638
 dump-sd: append.c:232-0 write_record FI=62 SessId=1 Strm=SPARSE-GZIP len=49093
 dump-sd: append.c:252-0 Enter bnet_get
 dump-sd: append.c:213-0 before writ_rec FI=62 SessId=1 Strm=SPARSE-GZIP 
len=54077
 dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1 
Strm=SPARSE-GZIP len=54077
 rem=55838 remainder=0
 dump-sd: append.c:232-0 write_record FI=62 SessId=1 Strm=SPARSE-GZIP len=54077
 dump-sd: append.c:252-0 Enter bnet_get
 <<END>>

So it appears the crash happens after «Enter bnet_get» but before the
«before writ_rec» message would have been printed.

I have tcpdumps for all network traffic from when the crash happened,
too, but they're a bit too large for posting here.

Any idea of what could cause this?

Regards
-- 
Tore Anderson

Using host libthread_db library "/lib/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
[New Thread 46912506654064 (LWP 4195)]
[New Thread 1098918240 (LWP 4945)]
[New Thread 1082132832 (LWP 4941)]
[New Thread 1090525536 (LWP 4201)]
0x00002aaaab33dd76 in select () from /lib/libc.so.6
$1 = "dump-sd", '\0' <repeats 22 times>
$2 = 0x56e0a8 "bacula-sd"
$3 = 0x56e0e8 "/usr/local/sbin/bacula-sd"
$4 = 0x0
$5 = 0x45d1c8 "2.2.8 (26 January 2008)"
$6 = 0x4592a4 "x86_64-unknown-linux-gnu"
$7 = 0x4592ce "debian"
$8 = 0x4592bd "testing/unstable"
#0  0x00002aaaab33dd76 in select () from /lib/libc.so.6
#1  0x000000000043253c in bnet_thread_server (addrs=0x56ea88, max_clients=21, 
client_wq=0x56ca20, 
    handle_client_request=0x416690 <handle_connection_request(void*)>) at 
bnet_server.c:161
#2  0x0000000000404742 in main (argc=<value optimized out>, argv=<value 
optimized out>) at stored.c:265

Thread 4 (Thread 1090525536 (LWP 4201)):
#0  0x00002aaaaace0c8f in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#1  0x000000000044947c in watchdog_thread (arg=<value optimized out>) at 
watchdog.c:307
#2  0x00002aaaaacde0fa in start_thread () from /lib/libpthread.so.0
#3  0x00002aaaab344ce2 in clone () from /lib/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 3 (Thread 1082132832 (LWP 4941)):
#0  0x00002aaaaace319f in __read_nocancel () from /lib/libpthread.so.0
#1  0x0000000000431a35 in read_nbytes (bsock=0x57f0e8, 
    ptr=0x59e820 
"îgª,²\002·/\236\036&/@ó\223\as`g±H»©\025ò÷ɹDê\004\024\032ÛËÖl©c÷\036øÜ0ðd\235É\237\231¯HÎñ\224Ýýi\006\027\b\031òOÒ3â\234´ô\0227gM®Ç\206¶÷eù´eìo\016K\005È\037¬'µz\211ðÿ\001(%¾P",
 nbytes=52912) at bnet.c:82
#2  0x0000000000433962 in BSOCK::recv (this=0x57f0e8) at bsock.c:435
#3  0x000000000043114f in bget_msg (sock=0x57f0e8) at bget_msg.c:60
#4  0x000000000040997b in do_append_data (jcr=0x570d28) at append.c:202
#5  0x000000000041ae34 in append_data_cmd (jcr=0x570d28) at fd_cmds.c:194
#6  0x000000000041acd6 in do_fd_commands (jcr=0x570d28) at fd_cmds.c:165
#7  0x000000000041b815 in run_job (jcr=0x570d28) at fd_cmds.c:128
#8  0x000000000041ba15 in run_cmd (jcr=0x570d28) at job.c:210
#9  0x0000000000416bd7 in handle_connection_request (arg=<value optimized out>) 
at dircmd.c:229
#10 0x000000000044a04d in workq_server (arg=<value optimized out>) at 
workq.c:357
#11 0x00002aaaaacde0fa in start_thread () from /lib/libpthread.so.0
#12 0x00002aaaab344ce2 in clone () from /lib/libc.so.6
#13 0x0000000000000000 in ?? ()

Thread 2 (Thread 1098918240 (LWP 4945)):
#0  0x00002aaaaace40ca in waitpid () from /lib/libpthread.so.0
#1  0x0000000000446f91 in signal_handler (sig=11) at signal.c:167
#2  <signal handler called>
#3  0x00000000004076c8 in detach_dcr_from_dev (dcr=0x574d38) at acquire.c:693
#4  0x00000000004076f2 in free_dcr (dcr=0x44f0f6) at acquire.c:714
#5  0x000000000042b779 in despool_data (dcr=0x580e28, commit=true) at 
spool.c:336
#6  0x000000000042c15a in commit_data_spool (dcr=0x580e28) at spool.c:139
#7  0x0000000000409554 in do_append_data (jcr=0x57fd68) at append.c:318
#8  0x000000000041ae34 in append_data_cmd (jcr=0x57fd68) at fd_cmds.c:194
#9  0x000000000041acd6 in do_fd_commands (jcr=0x57fd68) at fd_cmds.c:165
#10 0x000000000041b815 in run_job (jcr=0x57fd68) at fd_cmds.c:128
#11 0x000000000041ba15 in run_cmd (jcr=0x57fd68) at job.c:210
#12 0x0000000000416bd7 in handle_connection_request (arg=<value optimized out>) 
at dircmd.c:229
#13 0x000000000044a04d in workq_server (arg=<value optimized out>) at 
workq.c:357
#14 0x00002aaaaacde0fa in start_thread () from /lib/libpthread.so.0
#15 0x00002aaaab344ce2 in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()

Thread 1 (Thread 46912506654064 (LWP 4195)):
#0  0x00002aaaab33dd76 in select () from /lib/libc.so.6
#1  0x000000000043253c in bnet_thread_server (addrs=0x56ea88, max_clients=21, 
client_wq=0x56ca20, 
    handle_client_request=0x416690 <handle_connection_request(void*)>) at 
bnet_server.c:161
#2  0x0000000000404742 in main (argc=<value optimized out>, argv=<value 
optimized out>) at stored.c:265
#0  0x00002aaaab33dd76 in select () from /lib/libc.so.6
#0  0x00002aaaab33dd76 in select () from /lib/libc.so.6
No symbol table info available.
#1  0x000000000043253c in bnet_thread_server (addrs=0x56ea88, max_clients=21, 
client_wq=0x56ca20, 
    handle_client_request=0x416690 <handle_connection_request(void*)>) at 
bnet_server.c:161
161           if ((stat = select(maxfd + 1, &sockset, NULL, NULL, NULL)) < 0) {
Current language:  auto; currently c++
maxfd = <value optimized out>
sockset = {fds_bits = {16, 0 <repeats 15 times>}}
#2  0x0000000000404742 in main (argc=<value optimized out>, argv=<value 
optimized out>) at stored.c:265
265                           &dird_workq, handle_connection_request);
ch = <value optimized out>
no_signals = <value optimized out>
test_config = false
thid = 1082132832
uid = 0x0
gid = 0x0
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#0  0x0000000000000000 in ?? ()
No symbol table info available.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer ends at 11:59 p.m., 
Monday, April 7! Use priority code J8TLD2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users