Bacula-users

Re: [Bacula-users] Accurate backup and memory usage

2011-03-19 13:32:51
Subject: Re: [Bacula-users] Accurate backup and memory usage
From: Christian Manal <moenoel AT informatik.uni-bremen DOT de>
To: bacula-users AT lists.sourceforge DOT net
Date: Sat, 19 Mar 2011 18:29:15 +0100
Am 18.03.2011 21:37, schrieb Martin Simmons:
>>>>>> On Fri, 18 Mar 2011 20:47:03 +0100, Christian Manal said:
>>
>> Am 18.03.2011 19:26, schrieb Martin Simmons:
>>>>>>>> On Fri, 18 Mar 2011 13:36:36 +0100, Christian Manal said:
>>>>
>>>> Am 18.03.2011 13:03, schrieb Martin Simmons:
>>>>>>>>>> On Fri, 18 Mar 2011 11:37:33 +0100, Christian Manal said:
>>>>>>
>>>>>> Am 18.03.2011 10:40, schrieb Christian Manal:
>>>>>> Am 16.03.2011 09:14, schrieb Christian Manal:
>>>>>>>> Am 15.03.2011 19:12, schrieb Christian Manal:
>>>>>>>> Am 15.03.2011 17:49, schrieb Kjetil Torgrim Homme:
>>>>>>>>>> Christian Manal <moenoel AT informatik.uni-bremen DOT de> writes:
>>>>>>>>>>
>>>>>>>>>> Also, after several accurate jobs running without restarting Bacula,
>>>>>>>>>> the total memory usage of the director and fd didn't go up anymore, 
>>>>>>>>>> so
>>>>>>>>>> I presume it comes down to the behavior of Solaris' free(), as
>>>>>>>>>> described in the above quoted manpage.
>>>>>>>>>>
>>>>>>>>>> libumem may work better -- just set LD_PRELOAD, you don't have to
>>>>>>>>>> recompile.  I'd appreciate it if you report back if you try it.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> Actually, I already did that. Modified the startup script for the
>>>>>>>> affected fd (don't want the director crashing if things go wrong) and
>>>>>>>> restarted. I will report the results tomorrow.
>>>>>>>>
>>>>>>>> Looks good. 
>>>>>>>
>>>>>> Maybe I spoke too soon. Last night my director crashed with a segfault,
>>>>>> after switching to libumem. Leading to that was an unusually long
>>>>>> running job (the accurate one) which, going by the size, looked like it
>>>>>> was doing a full instead of incremental for some reason.
>>>>>>>
>>>>>> I have some output from mdb and pstack attached.
>>>>>>
>>>>>> And going by dbx, the dir went kaboom in Jmsg().
>>>>>> ...
>>>>>> =>[1] Jmsg(0xbefe5be0, 0x1, 0x0, 0x0, 0xfee8e25e, 0xf6caddb0), at 
>>>>>> 0xfee6a580 
>>>>>>   [2] j_msg(0x80c360e, 0x154, 0xbefe5be0, 0x1, 0x0, 0x0), at 0xfee6a7ad 
>>>>>>   [3] start_storage_daemon_message_thread(0xbefe5be0, 0x80bc7f5, 
>>>>>> 0xfdc7f960, 0x0, 0x80bc798, 0xfde8fe6c), at 0x80834bc 
>>>>>>   [4] do_backup(0xbefe5be0, 0x4, 0x0, 0xfdf91200, 0xfeea26e4, 
>>>>>> 0xfdf91200), at 0x80658b0 
>>>>>>   [5] _ZL10job_threadPv(0xbefe5be0, 0x1, 0xfe7c0dc7, 0xfe8422cc, 
>>>>>> 0xfe8422c0, 0xfdf91200), at 0x807a96e 
>>>>>>   [6] jobq_server(0x80e5080), at 0x807d127 
>>>>>>   [7] _thr_setup(0xfdf91200), at 0xfe7c7e66 
>>>>>>   [8] _lwp_start(0xfee8e708, 0x0, 0x0, 0xfde8ea00, 0x7, 0x0), at 
>>>>>> 0xfe7c8150 
>>>>>
>>>>> It looks like it ran out of memory (the segfault is deliberate, due to 
>>>>> failure
>>>>> to create a thread in start_storage_daemon_message_thread).
>>>>
>>>> That's strange. I'm monitoring that box with Nagios + pnp4nagios.
>>>> Neither did Nagios report unusually high memory usage nor do I see a
>>>> spike on the pnp4nagios graphs for memory and swap.
>>>>
>>>>
>>>>> Did it write any info to the Bacula log?  It should say "Cannot create 
>>>>> message
>>>>> thread:" followed by the error message.
>>>>
>>>> The logfile just cleanly ends after the last finished job. But it seems
>>>> to be in the coredump:
>>>>
>>>> core:msgchan.c:340 Cannot create message thread: Resource temporarily
>>>> unavailable
>>>
>>> "Resource temporarily unavailable" occurs when Solaris can't allocate the
>>> stack for a new thread, so memory pressure is a likely reason.  It may be
>>> invisible to Nagios if the memory is just reserved rather than being in use
>>> (something that malloc implementations will do differently).
>>>
>>
>> Hm.. but this didn't happen until I switched the director to libumem and
>> the servers runs several other services which didn't blow up with no
>> memory. So it looks like it has something to do with dir+umem, doesn't it?
> 
> Yes, but changing the memory allocator can have far-reaching consequences.
> How large was the core dump?
> 

1.8G


>> I think I may set up a test environment, when I have time, to take a
>> closer look at this issue.
> 
> You could try running pmap to see how the memory layout changes while it is
> doing the backup.
> 
> Also, building Bacula as a 64-bit program might solve it (if you can get all
> of the dependent libraries in 64-bit format).
> 

That's a good pointer. I will try that.


Regards,
Christian Manal

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users