* Tore Anderson
> Okay, I'll try to run both the SD and the DIR in debug mode, with a
> tcpdump running on the loopback interface. Hopefully I'll get
> another aborted job that'll tell me more. Might not be able to do so
> before the weekend though.
I've found out a bit more. It seems bacula-sd segfaults, which explains
why all the directors gets their connections reset. The backtrace is
attached. This happened with two incremental backups running, dumping
to file-based storage.
I got this error while running bacula-sd with debug-level 1000, but that
debug log doesn't tell me anything interesting, it loops with similar
messages for quite a while (included only the last few for context
here), then silence:
[....]
dump-sd: append.c:232-0 write_record FI=62 SessId=1 Strm=SPARSE-GZIP len=50822
dump-sd: append.c:252-0 Enter bnet_get
dump-sd: append.c:213-0 before writ_rec FI=62 SessId=1 Strm=SPARSE-GZIP
len=52063
dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1
Strm=SPARSE-GZIP len=52063
rem=28066 remainder=0
dump-sd: append.c:217-0 !write_record_to_block data_len=52063 rem=24009
dump-sd: spool.c:471-0 Wrote block FI=62 LI=62
dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1
Strm=SPARSE-GZIP len=52063
rem=64488 remainder=24009
dump-sd: append.c:232-0 write_record FI=62 SessId=1 Strm=SPARSE-GZIP len=52063
dump-sd: append.c:252-0 Enter bnet_get
dump-sd: append.c:213-0 before writ_rec FI=62 SessId=1 Strm=SPARSE-GZIP
len=49093
dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1
Strm=SPARSE-GZIP len=49093
rem=40467 remainder=0
dump-sd: append.c:217-0 !write_record_to_block data_len=49093 rem=8638
dump-sd: spool.c:471-0 Wrote block FI=62 LI=62
dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1
Strm=SPARSE-GZIP len=49093
rem=64488 remainder=8638
dump-sd: append.c:232-0 write_record FI=62 SessId=1 Strm=SPARSE-GZIP len=49093
dump-sd: append.c:252-0 Enter bnet_get
dump-sd: append.c:213-0 before writ_rec FI=62 SessId=1 Strm=SPARSE-GZIP
len=54077
dump-sd: record.c:260-0 write_record_to_block() FI=62 SessId=1
Strm=SPARSE-GZIP len=54077
rem=55838 remainder=0
dump-sd: append.c:232-0 write_record FI=62 SessId=1 Strm=SPARSE-GZIP len=54077
dump-sd: append.c:252-0 Enter bnet_get
<<END>>
So it appears the crash happens after «Enter bnet_get» but before the
«before writ_rec» message would have been printed.
I have tcpdumps for all network traffic from when the crash happened,
too, but they're a bit too large for posting here.
Any idea of what could cause this?
Regards
--
Tore Anderson
Using host libthread_db library "/lib/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
[New Thread 46912506654064 (LWP 4195)]
[New Thread 1098918240 (LWP 4945)]
[New Thread 1082132832 (LWP 4941)]
[New Thread 1090525536 (LWP 4201)]
0x00002aaaab33dd76 in select () from /lib/libc.so.6
$1 = "dump-sd", '\0' <repeats 22 times>
$2 = 0x56e0a8 "bacula-sd"
$3 = 0x56e0e8 "/usr/local/sbin/bacula-sd"
$4 = 0x0
$5 = 0x45d1c8 "2.2.8 (26 January 2008)"
$6 = 0x4592a4 "x86_64-unknown-linux-gnu"
$7 = 0x4592ce "debian"
$8 = 0x4592bd "testing/unstable"
#0 0x00002aaaab33dd76 in select () from /lib/libc.so.6
#1 0x000000000043253c in bnet_thread_server (addrs=0x56ea88, max_clients=21,
client_wq=0x56ca20,
handle_client_request=0x416690 <handle_connection_request(void*)>) at
bnet_server.c:161
#2 0x0000000000404742 in main (argc=<value optimized out>, argv=<value
optimized out>) at stored.c:265
Thread 4 (Thread 1090525536 (LWP 4201)):
#0 0x00002aaaaace0c8f in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib/libpthread.so.0
#1 0x000000000044947c in watchdog_thread (arg=<value optimized out>) at
watchdog.c:307
#2 0x00002aaaaacde0fa in start_thread () from /lib/libpthread.so.0
#3 0x00002aaaab344ce2 in clone () from /lib/libc.so.6
#4 0x0000000000000000 in ?? ()
Thread 3 (Thread 1082132832 (LWP 4941)):
#0 0x00002aaaaace319f in __read_nocancel () from /lib/libpthread.so.0
#1 0x0000000000431a35 in read_nbytes (bsock=0x57f0e8,
ptr=0x59e820
"îgª,²\002·/\236\036&/@ó\223\as`g±H»©\025ò÷ɹDê\004\024\032ÛËÖl©c÷\036øÜ0ðd\235É\237\231¯HÎñ\224Ýýi\006\027\b\031òOÒ3â\234´ô\0227gM®Ç\206¶÷eù´eìo\016K\005È\037¬'µz\211ðÿ\001(%¾P",
nbytes=52912) at bnet.c:82
#2 0x0000000000433962 in BSOCK::recv (this=0x57f0e8) at bsock.c:435
#3 0x000000000043114f in bget_msg (sock=0x57f0e8) at bget_msg.c:60
#4 0x000000000040997b in do_append_data (jcr=0x570d28) at append.c:202
#5 0x000000000041ae34 in append_data_cmd (jcr=0x570d28) at fd_cmds.c:194
#6 0x000000000041acd6 in do_fd_commands (jcr=0x570d28) at fd_cmds.c:165
#7 0x000000000041b815 in run_job (jcr=0x570d28) at fd_cmds.c:128
#8 0x000000000041ba15 in run_cmd (jcr=0x570d28) at job.c:210
#9 0x0000000000416bd7 in handle_connection_request (arg=<value optimized out>)
at dircmd.c:229
#10 0x000000000044a04d in workq_server (arg=<value optimized out>) at
workq.c:357
#11 0x00002aaaaacde0fa in start_thread () from /lib/libpthread.so.0
#12 0x00002aaaab344ce2 in clone () from /lib/libc.so.6
#13 0x0000000000000000 in ?? ()
Thread 2 (Thread 1098918240 (LWP 4945)):
#0 0x00002aaaaace40ca in waitpid () from /lib/libpthread.so.0
#1 0x0000000000446f91 in signal_handler (sig=11) at signal.c:167
#2 <signal handler called>
#3 0x00000000004076c8 in detach_dcr_from_dev (dcr=0x574d38) at acquire.c:693
#4 0x00000000004076f2 in free_dcr (dcr=0x44f0f6) at acquire.c:714
#5 0x000000000042b779 in despool_data (dcr=0x580e28, commit=true) at
spool.c:336
#6 0x000000000042c15a in commit_data_spool (dcr=0x580e28) at spool.c:139
#7 0x0000000000409554 in do_append_data (jcr=0x57fd68) at append.c:318
#8 0x000000000041ae34 in append_data_cmd (jcr=0x57fd68) at fd_cmds.c:194
#9 0x000000000041acd6 in do_fd_commands (jcr=0x57fd68) at fd_cmds.c:165
#10 0x000000000041b815 in run_job (jcr=0x57fd68) at fd_cmds.c:128
#11 0x000000000041ba15 in run_cmd (jcr=0x57fd68) at job.c:210
#12 0x0000000000416bd7 in handle_connection_request (arg=<value optimized out>)
at dircmd.c:229
#13 0x000000000044a04d in workq_server (arg=<value optimized out>) at
workq.c:357
#14 0x00002aaaaacde0fa in start_thread () from /lib/libpthread.so.0
#15 0x00002aaaab344ce2 in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()
Thread 1 (Thread 46912506654064 (LWP 4195)):
#0 0x00002aaaab33dd76 in select () from /lib/libc.so.6
#1 0x000000000043253c in bnet_thread_server (addrs=0x56ea88, max_clients=21,
client_wq=0x56ca20,
handle_client_request=0x416690 <handle_connection_request(void*)>) at
bnet_server.c:161
#2 0x0000000000404742 in main (argc=<value optimized out>, argv=<value
optimized out>) at stored.c:265
#0 0x00002aaaab33dd76 in select () from /lib/libc.so.6
#0 0x00002aaaab33dd76 in select () from /lib/libc.so.6
No symbol table info available.
#1 0x000000000043253c in bnet_thread_server (addrs=0x56ea88, max_clients=21,
client_wq=0x56ca20,
handle_client_request=0x416690 <handle_connection_request(void*)>) at
bnet_server.c:161
161 if ((stat = select(maxfd + 1, &sockset, NULL, NULL, NULL)) < 0) {
Current language: auto; currently c++
maxfd = <value optimized out>
sockset = {fds_bits = {16, 0 <repeats 15 times>}}
#2 0x0000000000404742 in main (argc=<value optimized out>, argv=<value
optimized out>) at stored.c:265
265 &dird_workq, handle_connection_request);
ch = <value optimized out>
no_signals = <value optimized out>
test_config = false
thid = 1082132832
uid = 0x0
gid = 0x0
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Register now and save $200. Hurry, offer ends at 11:59 p.m.,
Monday, April 7! Use priority code J8TLD2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|