Bacula-users

Re: [Bacula-users] Client backups crash director until full backup is run -- UPDATE

2009-08-25 04:34:39
Subject: Re: [Bacula-users] Client backups crash director until full backup is run -- UPDATE
From: Radovan Mzik <radovan.mzik AT linuxbox DOT cz>
To: Corey Shaw <cshaw AT q90 DOT com>
Date: Tue, 25 Aug 2009 10:09:36 +0200 (CEST)
Hello,

it seems that I've encountered similar problem. Last two days director 
stucked when starting random client backup. Director process is running in 
two instances, but is not possible to connect from bconsole. Fortunately I 
had one bconsole running, so below please find last few error messages.

Backup for the same client which was running when director hung is 
finished successfully after director restart.

I'm running:

Centos 5 x86_64
Bacula 3.0.2 (all fd, sd and dir are in same version)
Accurate backups enabled for all clients

Last message from bconsole:

25-Aug 00:23 databox-dir JobId 1422: Start Backup JobId 1422, 
Job=Dev1Daily.2009-08-25_00.05.01_04
25-Aug 00:23 databox-dir JobId 1422: Using Device "FileStorage"
25-Aug 00:23 databox-dir: ABORTING due to ERROR in smartall.c:196
double free from smartall.c:330

Backtrace from parent bacula-dir process:

#0  0x00007f96b84d6e74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f96b84d2874 in _L_lock_106 () from /lib64/libpthread.so.0
#2  0x00007f96b84d22e0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f96b92caf32 in lmgr_p () from /usr/lib64/libbac.so.1
#4  0x00007f96b92af792 in sm_get_pool_memory () from /usr/lib64/libbac.so.1
#5  0x00007f96b92a4f83 in new_jcr () from /usr/lib64/libbac.so.1
#6  0x000000000043786d in wait_for_next_job ()
#7  0x000000000040e9d6 in main ()

Backtrace from child bacula-dir process:

#0  0x00007f96b84d6e74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f96b84d2874 in _L_lock_106 () from /lib64/libpthread.so.0
#2  0x00007f96b84d22e0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f96b92caf32 in lmgr_p () from /usr/lib64/libbac.so.1
#4  0x00007f96b92af792 in sm_get_pool_memory () from /usr/lib64/libbac.so.1
#5  0x000000000041303c in berrno::berrno ()
#6  0x00007f96b92b9a68 in signal_handler () from /usr/lib64/libbac.so.1
#7  <signal handler called>
#8  0x00007f96b92acbe7 in e_msg () from /usr/lib64/libbac.so.1
#9  0x00007f96b92ba79e in sm_free () from /usr/lib64/libbac.so.1
#10 0x00007f96b92baa7c in sm_realloc () from /usr/lib64/libbac.so.1
#11 0x00007f96b92ae39d in sm_realloc_pool_memory () from /usr/lib64/libbac.so.1
#12 0x00007f96b92ae563 in sm_check_pool_memory_size () from 
/usr/lib64/libbac.so.1
#13 0x00007f96b92aec7e in pm_strcat () from /usr/lib64/libbac.so.1
#14 0x00007f96b9a20c4c in db_get_int_handler () from /usr/lib64/libbacsql.so.1
#15 0x00007f96b9a28f2d in db_sql_query () from /usr/lib64/libbacsql.so.1
#16 0x00007f96b9a20f02 in db_accurate_get_jobids () from 
/usr/lib64/libbacsql.so.1
#17 0x0000000000412635 in send_accurate_current_files ()
#18 0x0000000000412cc7 in do_backup ()
#19 0x000000000042835c in job_thread ()
#20 0x0000000000429eac in jobq_server ()
#21 0x00007f96b84d0367 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f96b710e09d in clone () from /lib64/libc.so.6


I hope it would help to identify cause of the problem.

Many thanks in advance

Radovan

On Fri, 21 Aug 2009, Corey Shaw wrote:

> I think that I finally found what was causing the problem.  As soon as I 
> turned off accurate backups,
> everything started working fine.  I don't know what changed to make 
> those be an issue, but I've got around the problem for now.  Thanks for the 
> idea though.
>
> _____________________
> Corey Shaw
> Technology Specialist
> O. 801.491.0705 (x. 157)
> F. 801.491.8774
>
> Winner of the 2009 Utah Work/Life Award
>
> ----- Original Message -----
> From: "Alan Brown" <ajb2 AT mssl.ucl.ac DOT uk>
> To: "Corey Shaw" <cshaw AT q90 DOT com>
> Cc: bacula-users AT lists.sourceforge DOT net
> Sent: Friday, August 21, 2009 4:07:04 AM GMT -07:00 US/Canada Mountain
> Subject: Re: [Bacula-users] Client backups crash director until full backup 
> is run -- UPDATE
>
> On Wed, 19 Aug 2009, Corey Shaw wrote:
>
>> I ran the memtest on our bacula server last night. After 14 hours and 8
>> passes it didn't find any problems. I'm at the end of my rope here. I'm
>> trying a new virtual server to see if that fixes the issue.
>
> I've found I can reliably trigger faults on a few mainboards by asking
> memtest to use "bios all" or "probe memory" options instead of just using
> the default settings.
>
> Intel i865 boards are particularly prone to this kind of issue and it
> manifests in a running system as lockups under extremely heavy IO load.
>
> AB
>
>
>
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users
>
>

-- 
Radovan MZIK
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 737 238 588
jabber: mzik AT gw.lbox DOT cz
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis AT linuxbox DOT cz


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users