Veritas-bu

[Veritas-bu] DLT drives going down

2006-01-04 09:09:48
Subject: [Veritas-bu] DLT drives going down
From: layne.barber.ctr AT csd.disa DOT mil (Barber, Layne (Contractor))
Date: Wed, 4 Jan 2006 08:09:48 -0600
This is a multi-part message in MIME format.

------_=_NextPart_001_01C61138.861CE900
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

We have an issue of drives randomly going down every night. NBU 5.0 mp5
HP-UX 11.11 STK L180 w/ STK 3400 scsi bridge.
=20
For some reason, 1 or more drives go down at random every night when
backups run. Different tapes and different drives. Backups will be
running fine and then drives begin to go down. These are SDLT320 drives.
once they go down, you can't use robtest to move the tapes (medium not
present error) or use the robtest unload command (device not present).
=20
If we power cycle the scsi bridge, we can talk to the drives and do what
ever we want. STK is claiming that there is something coming from the
host that is "polling" the library from the physical layer (assume HBA).
We have had the SA for the master/media server disable any polling and
load the latest patches from HP to no avail. We have changed from auto
index to a manual map index as well.
=20
This was working from the end of June up until the second week in
October.
=20
Thoughts/suggestions?
=20
Log snippets from last night:
=20

syslog entries
Jan  4 05:37:42 ujachr01 vmunix: SCSI TAPE: dev =3D 0xcd0801c0 I/O error
during close
Jan  4 05:50:10 ujachr01 vmunix: SCSI TAPE: dev =3D 0xcd0801c0 I/O error
during close
Jan  4 11:27:52 ujachr01 vmunix: SCSI TAPE: dev =3D 0xcd0800c0 I/O error
during close
Jan  4 11:34:36 ujachr01 tldcd[18968]: TLD(1) key =3D 0x5, asc =3D 0x3a,
ascq =3D 0x0, MEDIUM NOT PRESENT
Jan  4 11:34:36 ujachr01 tldcd[18968]: TLD(1) Move_medium error
Jan  4 11:34:36 ujachr01 tldd[4233]: TLD(1) drive 5 (device 4) is being
DOWNED, status: Robotic dismount failure
Jan  4 11:34:36 ujachr01 tldd[4233]: Check integrity of the drive, drive
path, and media
=20
drive 5 (addr 504) access =3D 0 Contains Cartridge =3D yes
Source address =3D 1119 (slot 120)
Barcode =3D JA1156
=20

Jan  4 11:55:12 ujachr01 tldcd[19684]: TLD(1) key =3D 0x5, asc =3D 0x3a,
ascq =3D 0x0, MEDIUM NOT PRESENT
Jan  4 11:55:12 ujachr01 tldcd[19684]: TLD(1) Move_medium error
Jan  4 11:55:12 ujachr01 tldd[4233]: TLD(1) drive 1 (device 0) is being
DOWNED, status: Robotic dismount failure
Jan  4 11:55:12 ujachr01 tldd[4233]: Check integrity of the drive, drive
path, and media
=20
drive 1 (addr 500) access =3D 0 Contains Cartridge =3D yes
Source address =3D 1106 (slot 107)
Barcode =3D JA1064


------_=_NextPart_001_01C61138.861CE900
Content-Type: text/html;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2900.2802" name=3DGENERATOR></HEAD>
<BODY>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>We =
have an issue of=20
drives randomly going down every night. NBU 5.0 mp5 HP-UX 11.11 STK L180 =
w/ STK=20
3400 scsi bridge.</FONT></SPAN></DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>For =
some reason, 1=20
or more drives go down at random every night when backups run. Different =
tapes=20
and different drives. Backups will be running fine and then drives begin =
to go=20
down. These are SDLT320 drives. once they go down, you can't use robtest =
to move=20
the tapes (medium not present error) or use the robtest unload command =
(device=20
not present).</FONT></SPAN></DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>If we =
power cycle=20
the scsi bridge, we can talk to the drives and do what ever we want. STK =
is=20
claiming that there is something coming from the host that is "polling" =
the=20
library from the physical layer (assume HBA). We have had the SA for the =

master/media server disable any polling and load the latest patches from =
HP to=20
no avail. We have changed from auto index to a manual map index as=20
well.</FONT></SPAN></DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>This =
was working=20
from the end of June up until the second week in =
October.</FONT></SPAN></DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
size=3D2>Thoughts/suggestions?</FONT></SPAN></DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>Log =
snippets from=20
last night:</FONT></SPAN></DIV>
<DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV><SPAN =
class=3D222420014-04012006><FONT face=3DArial=20
size=3D2>
<DIV><BR>syslog entries<BR>Jan&nbsp; 4 05:37:42 ujachr01 vmunix: SCSI =
TAPE: dev=20
=3D 0xcd0801c0 I/O error during close<BR>Jan&nbsp; 4 05:50:10 ujachr01 =
vmunix:=20
SCSI TAPE: dev =3D 0xcd0801c0 I/O error during close<BR>Jan&nbsp; 4 =
11:27:52=20
ujachr01 vmunix: SCSI TAPE: dev =3D 0xcd0800c0 I/O error during =
close<BR>Jan&nbsp;=20
4 11:34:36 ujachr01 tldcd[18968]: TLD(1) key =3D 0x5, asc =3D 0x3a, ascq =
=3D 0x0,=20
MEDIUM NOT PRESENT<BR>Jan&nbsp; 4 11:34:36 ujachr01 tldcd[18968]: TLD(1) =

Move_medium error<BR>Jan&nbsp; 4 11:34:36 ujachr01 tldd[4233]: TLD(1) =
drive 5=20
(device 4) is being DOWNED, status: Robotic dismount =
failure<BR>Jan&nbsp; 4=20
11:34:36 ujachr01 tldd[4233]: Check integrity of the drive, drive path, =
and=20
media</DIV>
<DIV>&nbsp;</DIV>
<DIV>drive 5 (addr 504) access =3D 0 Contains Cartridge =3D =
yes<BR>Source address =3D=20
1119 (slot 120)<BR>Barcode =3D JA1156</DIV>
<DIV>&nbsp;</DIV>
<DIV><BR>Jan&nbsp; 4 11:55:12 ujachr01 tldcd[19684]: TLD(1) key =3D 0x5, =
asc =3D=20
0x3a, ascq =3D 0x0, MEDIUM NOT PRESENT<BR>Jan&nbsp; 4 11:55:12 ujachr01=20
tldcd[19684]: TLD(1) Move_medium error<BR>Jan&nbsp; 4 11:55:12 ujachr01=20
tldd[4233]: TLD(1) drive 1 (device 0) is being DOWNED, status: Robotic =
dismount=20
failure<BR>Jan&nbsp; 4 11:55:12 ujachr01 tldd[4233]: Check integrity of =
the=20
drive, drive path, and media</DIV>
<DIV>&nbsp;</DIV>
<DIV>drive 1 (addr 500) access =3D 0 Contains Cartridge =3D =
yes<BR>Source address =3D=20
1106 (slot 107)<BR>Barcode =3D =
JA1064<BR></FONT></SPAN></DIV></BODY></HTML>

------_=_NextPart_001_01C61138.861CE900--

<Prev in Thread] Current Thread [Next in Thread>