Bacula-users

Re: [Bacula-users] Need help debugging SD crash

2010-04-06 05:26:20
Subject: Re: [Bacula-users] Need help debugging SD crash
From: Matija Nalis <mnalis+bacula AT CARNet DOT hr>
To: Robert LeBlanc <robert AT leblancnet DOT us>
Date: Tue, 6 Apr 2010 11:23:44 +0200
On Sun, Apr 04, 2010 at 01:20:49PM -0600, Robert LeBlanc wrote:
> I'm having problems with our SD and tapes being locked in the
> drive occasionally.

How does it manifest exactly ? bconsole umount command returns error,
or remains in some state (check with "status storage") ? Which state
and/or error ?

Have you tried shutting down bacula-sd and ejecting tape with "mt
eject" and/or "mt offline" ? Do they succeed (and the drive ejects)
or do they return error (and which one) ? Double check that bacula-sd
is down before you try those (they won't work if bacula-sd is still
having the drive open).

And if mt(1) also fails, can you eject tape manually by using tape
library eject function and/or pressing hardware eject button on the
drive itself (depending on the library type...) ?

If mt works but bacula-sd doesn't, than you can rule out hardware and
kernel -- it is bacula problem (and usually "status storage" will
show it -- it can happen sometimes if you have more than one drive
that it deadlocks by waiting for a tape that is in the other drive).

> At first I thought this might be a problem with our tape
> library.

That still looks like the most probable cause to me - like a drive in
the library is having problems. We've had a similar issue with one of
several LTO2 drives in our library; it would (sometimes) take the
tape and refuse to give it back (on "mt eject" and even physical
button touch). Needed power cycling and long (half a minute?) button
press to make it give the tape back.

After it happened third time (always the same drive) we kicked it out
of the library. Other drives worked OK all the time.

If the hardware button always works but software commands don't, it
could be fiber cables and/or GBIC/SPF (which we refused to believe at
one time because drives were always detected OK and worked, albeit
sometimes much slower than normal, without any errors in kernel logs,
and would also lock up). You can try cleaning tape also.

> Then I saw these errors in the syslog. I switched out the Qlogic FC
> adapter thinking that maybe it was just losing all the paths to the drive.

AFAIR you would get different errors if it loses path completely (but
it is possible for drive to behave erratically even if it doesn't
lose path)

> I'm still getting the errors, so I'm not sure where the hangup is. I can't
> tell if it's a bug in the kernel module, mt or bacula. Can someone give me
> some pointers to narrowing this down? This has been happening for over a
> year and through several kernel and bacula versions.
> 
> This is Debian Squeeze
> 
> Linux lsddomainsd 2.6.32-trunk-686 #1 SMP Sun Jan 10 06:32:16 UTC 2010 i686
> GNU/Linux

The "INFO:" messages themselves are just "normal" feature of newer
2.6.x kernels, they are informational message only (See "INFO:") that
tells you some system call (like open(2) or write(2) or read(2)) is
taking longer than 120 seconds to complete. They didn't exist in
older kernels.

It is there to catch problems with I/O schedulers and problematic
hardware issues -- but sometime it needs to be increased for tape
drives (it is quite possible for open(2) or lseek(2) on tape to have
to rewind it, and that sometimes can take more than two minutes).

you can raise the current kernel limit with:
echo 300 > /proc/sys/kernel/hung_task_timeout_secs

or (to survive reboot) by putting:
kernel.hung_task_timeout_secs=300
in /etc/sysctl.conf (or a file in /etc/sysctl.d directory)

But as I say, those will not help your lockup problems, just make the
spurious messages go away when they are to be expected.


Try the other things in the mail to narrow the problem down to
bacula, kernel or hardware.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>