Hi Arno,
Thanks for your reply - my comments are inline.
Quoting Arno Lehmann <al AT its-lehmann DOT de>:
> Hi,
>
> 28.05.2008 00:52, Andy Shellam wrote:
>
> Ah, another Nagios-user mailing list person :-)
Yes, I thought I'd seen your name somewhere else! I use Nagios to
monitor the Bacula processes, hence how my attention was drawn to the
fact that this one machine displays 3 processes instead of 1 for
bacula-fd :-)
>> While monitoring the server using top, when the backup starts the free
>> memory drops steadily from 200MB free down to 2MB, then hovers around
>> 2-5MB. The swap space isn't touched. 5 minutes later the machine dies
>> with a memory allocation error, still with all 512MB swap space free.
>>
>> My server provider tells me this is expected, that because it's a single
>> process eating into all the RAM, the server can't swap it. Is this
>> true?
>
> Hmm... I'm not sure. Might be. But that shouldn't crash the whole
> machine. If the FD gets killed - ok. But nothing worse should happen IMO.
That's what I thought; I can't believe the kernel would allow this to
happen. The start of the kernel stack-trace at the time of the crash
is:
[2362406.679548] BUG: unable to handle kernel paging request at
virtual address 00100104
[2362406.679561] printing eip:
[2362406.679564] c0175fce
[2362406.679568] 086e9000 -> *pde = 00000000:607e5001
[2362406.679571] 08f38000 -> *pme = 00000000:00000000
[2362406.679575] Oops: 0002 [#1]
[2362406.679577] SMP
etc etc etc!
The virtual machine runs under Xen, and I have access to its virtual
serial console, which is how I could pick up the stack-trace.
>
> If that's and old linux kernel the behaviour you see might be
> expected; newer ones tend to kill some processes before things get
> really dramatic (OOM killer as "Out Of Memory").
>
>> I have 2 other servers with the same provider, identical in every
>> way except they have 512MB RAM and 1GB swap, and they both backup just
>> fine.
>>
>> Another thing that is different is that on this troublesome machine, the
>> bacula startup script (/usr/local/bacula/etc/bacula start) starts 3
>> bacula-fd processes, but on all my other machines it only starts 1. Is
>> there any reason for this?
>
> Are these the same OS versions?
Yep, they're all Debian Linux 4.0. It is a newer machine, so I guess
the kernel versions could be different; I'll have a check. I wouldn't
mind betting that it's not fully patched come to think of it, so I'll
try that also.
>
> I suspect that on the small machine an older version of the (linux) OS
> or the ps program is running.
>
> The thing is that threads (aka Light-weight Processes) are sometimes
> displayed as spearate processes, and sometimes not, depending on the
> software you use.
>
> For example, on a reasonably new OS:
>
> arno@elf:~> ps -lfLC bacula-fd
> F S UID PID PPID LWP C NLWP PRI NI ADDR SZ WCHAN STIME
> TTY TIME CMD
> 1 S root 4718 1 4718 0 2 76 0 - 11968 - 2007 ?
> 00:00:21 /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf
> 1 S root 4718 1 4719 0 2 76 0 - 11968 322561 2007 ?
> 00:00:04 /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf
>
> This displays the threads separately.
>
> arno@elf:~> ps -lfC bacula-fd
> F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY
> TIME CMD
> 1 S root 4718 1 0 76 0 - 11968 - 2007 ?
> 00:00:21 /usr/
>
> This does not, but on older software it would display two processes.
I'll try these additional commands on the machine when I get chance.
I didnt notice the flags were different (Ss, Ssl, S) but I didn't pay
much attention to them not realising what they were!
>
>
>> On troublesome server (with backup job disabled):
>>
>> root ~ # ps aux|grep bacula-fd
>> root 3160 0.0 0.5 13200 1392 ? Ss 21:55 0:00
>> /usr/local/bacula/sbin/bacula-fd -u root -g root -v -c
>> /usr/local/bacula/etc/bacula-fd.conf
>> root 3162 0.0 0.5 13200 1392 ? S 21:55 0:00
>> /usr/local/bacula/sbin/bacula-fd -u root -g root -v -c
>> /usr/local/bacula/etc/bacula-fd.conf
>> root 3163 0.0 0.5 13200 1392 ? S 21:55 0:00
>> /usr/local/bacula/sbin/bacula-fd -u root -g root -v -c
>> /usr/local/bacula/etc/bacula-fd.conf
>>
>> On other servers (with backup job live as normal):
>> root ~ # ps aux|grep bacula-fd
>> root 11114 0.0 0.3 27208 1628 ? Ssl 19:26 0:00
>> /usr/local/bacula/sbin/bacula-fd -u root -g root -v -c
>> /usr/local/bacula/etc/bacula-fd.conf
>
> See the "l" in the process state? From my ps(1) man page:
>
>> l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
>
Could this be an indication that the threading library/model used on
the smaller machine is different to my other 2? If this is the case,
could this be a contributing factor to the crash?
>
> ...
>
> Arno
>
> --
> Arno Lehmann
> IT-Service Lehmann
> www.its-lehmann.de
>
Cheers,
Andy
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|