Bacula-users

Re: [Bacula-users] Memory and swap usage, plus duplicated processes

2008-05-28 05:42:29
Subject: Re: [Bacula-users] Memory and swap usage, plus duplicated processes
From: Andy Shellam <andy.shellam-lists AT mailnetwork.co DOT uk>
To: bacula-users AT lists.sourceforge DOT net
Date: Wed, 28 May 2008 10:47:38 +0100
Hi Arno,

Thanks for your reply - my comments are inline.


Quoting Arno Lehmann <al AT its-lehmann DOT de>:

> Hi,
>
> 28.05.2008 00:52, Andy Shellam wrote:
>
> Ah, another Nagios-user mailing list person :-)

Yes, I thought I'd seen your name somewhere else!  I use Nagios to  
monitor the Bacula processes, hence how my attention was drawn to the  
fact that this one machine displays 3 processes instead of 1 for  
bacula-fd :-)


>> While monitoring the server using top, when the backup starts the free
>> memory drops steadily from 200MB free down to 2MB, then hovers around
>> 2-5MB.  The swap space isn't touched.  5 minutes later the machine dies
>> with a memory allocation error, still with all 512MB swap space free.
>>
>> My server provider tells me this is expected, that because it's a single
>> process eating into all the RAM, the server can't swap it.  Is this
>> true?
>
> Hmm... I'm not sure. Might be. But that shouldn't crash the whole
> machine. If the FD gets killed - ok. But nothing worse should happen IMO.

That's what I thought; I can't believe the kernel would allow this to  
happen.  The start of the kernel stack-trace at the time of the crash  
is:

[2362406.679548] BUG: unable to handle kernel paging request at
virtual address                                00100104
[2362406.679561]  printing eip:
[2362406.679564] c0175fce
[2362406.679568] 086e9000 -> *pde = 00000000:607e5001
[2362406.679571] 08f38000 -> *pme = 00000000:00000000
[2362406.679575] Oops: 0002 [#1]
[2362406.679577] SMP

etc etc etc!

The virtual machine runs under Xen, and I have access to its virtual  
serial console, which is how I could pick up the stack-trace.

>
> If that's and old linux kernel the behaviour you see might be
> expected; newer ones tend to kill some processes before things get
> really dramatic (OOM killer as "Out Of Memory").
>
>>  I have 2 other servers with the same provider, identical in every
>> way except they have 512MB RAM and 1GB swap, and they both backup just
>> fine.
>>
>> Another thing that is different is that on this troublesome machine, the
>> bacula startup script (/usr/local/bacula/etc/bacula start) starts 3
>> bacula-fd processes, but on all my other machines it only starts 1.  Is
>> there any reason for this?
>
> Are these the same OS versions?

Yep, they're all Debian Linux 4.0.  It is a newer machine, so I guess  
the kernel versions could be different; I'll have a check.  I wouldn't  
mind betting that it's not fully patched come to think of it, so I'll  
try that also.

>
> I suspect that on the small machine an older version of the (linux) OS
> or the ps program is running.
>
> The thing is that threads (aka Light-weight Processes) are sometimes
> displayed as spearate processes, and sometimes not, depending on the
> software you use.
>
> For example, on a reasonably new OS:
>
> arno@elf:~> ps -lfLC bacula-fd
> F S UID        PID  PPID   LWP  C NLWP PRI  NI ADDR SZ WCHAN  STIME
> TTY          TIME CMD
> 1 S root      4718     1  4718  0    2  76   0 - 11968 -       2007 ?
>         00:00:21 /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf
> 1 S root      4718     1  4719  0    2  76   0 - 11968 322561  2007 ?
>         00:00:04 /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf
>
> This displays the threads separately.
>
> arno@elf:~> ps -lfC bacula-fd
> F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY
> TIME CMD
> 1 S root      4718     1  0  76   0 - 11968 -       2007 ?
> 00:00:21 /usr/
>
> This does not, but on older software it would display two processes.

I'll try these additional commands on the machine when I get chance.   
I didnt notice the flags were different (Ss, Ssl, S) but I didn't pay  
much attention to them not realising what they were!

>
>
>> On troublesome server (with backup job disabled):
>>
>> root ~ # ps aux|grep bacula-fd
>> root      3160  0.0  0.5  13200  1392 ?        Ss   21:55   0:00
>> /usr/local/bacula/sbin/bacula-fd -u root -g root -v -c
>> /usr/local/bacula/etc/bacula-fd.conf
>> root      3162  0.0  0.5  13200  1392 ?        S    21:55   0:00
>> /usr/local/bacula/sbin/bacula-fd -u root -g root -v -c
>> /usr/local/bacula/etc/bacula-fd.conf
>> root      3163  0.0  0.5  13200  1392 ?        S    21:55   0:00
>> /usr/local/bacula/sbin/bacula-fd -u root -g root -v -c
>> /usr/local/bacula/etc/bacula-fd.conf
>>
>> On other servers (with backup job live as normal):
>> root ~ # ps aux|grep bacula-fd
>> root     11114  0.0  0.3  27208  1628 ?        Ssl  19:26   0:00
>> /usr/local/bacula/sbin/bacula-fd -u root -g root -v -c
>> /usr/local/bacula/etc/bacula-fd.conf
>
> See the "l" in the process state? From my ps(1) man page:
>
>> l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
>

Could this be an indication that the threading library/model used on  
the smaller machine is different to my other 2?  If this is the case,  
could this be a contributing factor to the crash?

>
> ...
>
> Arno
>
> --
> Arno Lehmann
> IT-Service Lehmann
> www.its-lehmann.de
>

Cheers,

Andy

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users