Veritas-bu

[Veritas-bu] DLT drives going down

2006-01-04 09:52:25
Subject: [Veritas-bu] DLT drives going down
From: M.W.Ellwood AT rl.ac DOT uk (Ellwood, MW (Mike))
Date: Wed, 4 Jan 2006 14:52:25 -0000
This is a multi-part message in MIME format.

------_=_NextPart_001_01C6113E.799B8FB3
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Different hardware and software, but when we had a problem with drives
going down, it was caused by problems on the robot. Power cycling the
robot would usually get the robot going again, but it would often leave
the drive(s) in a DOWN condition, and you had to manually UP them with
vmoprcmd.  (This on Sun L20 robot, Solaris 5.10). I developed a little
script to UP any drives it found in a DOWN condition.
=20
(I think the problem in the robot was with the picker arm, and it went
away when that was replaced).
=20
Sorry, may not be too relevant to your problem, but I throw it into the
mix in case it triggers any thoughts.
=20
Regards,
Mike
=20

        -----Original Message-----
        From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Barber,
Layne (Contractor)
        Sent: 04 January 2006 14:10
        To: veritas-bu AT mailman.eng.auburn DOT edu
        Subject: [Veritas-bu] DLT drives going down
=09
=09
        We have an issue of drives randomly going down every night. NBU
5.0 mp5 HP-UX 11.11 STK L180 w/ STK 3400 scsi bridge.
        =20
        For some reason, 1 or more drives go down at random every night
when backups run. Different tapes and different drives. Backups will be
running fine and then drives begin to go down. These are SDLT320 drives.
once they go down, you can't use robtest to move the tapes (medium not
present error) or use the robtest unload command (device not present).
        =20
        If we power cycle the scsi bridge, we can talk to the drives and
do what ever we want. STK is claiming that there is something coming
from the host that is "polling" the library from the physical layer
(assume HBA). We have had the SA for the master/media server disable any
polling and load the latest patches from HP to no avail. We have changed
from auto index to a manual map index as well.
        =20
        This was working from the end of June up until the second week
in October.
        =20
        Thoughts/suggestions?
        =20
        Log snippets from last night:
        =20
=09

        syslog entries
        Jan  4 05:37:42 ujachr01 vmunix: SCSI TAPE: dev =3D 0xcd0801c0 I/O
error during close
        Jan  4 05:50:10 ujachr01 vmunix: SCSI TAPE: dev =3D 0xcd0801c0 I/O
error during close
        Jan  4 11:27:52 ujachr01 vmunix: SCSI TAPE: dev =3D 0xcd0800c0 I/O
error during close
        Jan  4 11:34:36 ujachr01 tldcd[18968]: TLD(1) key =3D 0x5, asc =3D
0x3a, ascq =3D 0x0, MEDIUM NOT PRESENT
        Jan  4 11:34:36 ujachr01 tldcd[18968]: TLD(1) Move_medium error
        Jan  4 11:34:36 ujachr01 tldd[4233]: TLD(1) drive 5 (device 4)
is being DOWNED, status: Robotic dismount failure
        Jan  4 11:34:36 ujachr01 tldd[4233]: Check integrity of the
drive, drive path, and media
        =20
        drive 5 (addr 504) access =3D 0 Contains Cartridge =3D yes
        Source address =3D 1119 (slot 120)
        Barcode =3D JA1156
        =20

        Jan  4 11:55:12 ujachr01 tldcd[19684]: TLD(1) key =3D 0x5, asc =3D
0x3a, ascq =3D 0x0, MEDIUM NOT PRESENT
        Jan  4 11:55:12 ujachr01 tldcd[19684]: TLD(1) Move_medium error
        Jan  4 11:55:12 ujachr01 tldd[4233]: TLD(1) drive 1 (device 0)
is being DOWNED, status: Robotic dismount failure
        Jan  4 11:55:12 ujachr01 tldd[4233]: Check integrity of the
drive, drive path, and media
        =20
        drive 1 (addr 500) access =3D 0 Contains Cartridge =3D yes
        Source address =3D 1106 (slot 107)
        Barcode =3D JA1064
=09


------_=_NextPart_001_01C6113E.799B8FB3
Content-Type: text/html;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Message</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2900.2802" name=3DGENERATOR></HEAD>
<BODY>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN=20
class=3D824364714-04012006>Different hardware and software, but when we =
had a=20
problem with drives going down, it was caused by problems on the robot. =
Power=20
cycling the robot would usually get the robot going again, but it would =
often=20
leave the drive(s) in a DOWN condition, and you had to manually UP them =
with=20
vmoprcmd.&nbsp; (This on Sun L20 robot, Solaris 5.10). I developed a =
little=20
script to UP any drives it found in a DOWN =
condition.</SPAN></FONT></DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN=20
class=3D824364714-04012006></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN =
class=3D824364714-04012006>(I=20
think the problem in the robot was with the picker arm, and it went away =
when=20
that was replaced).</SPAN></FONT></DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN=20
class=3D824364714-04012006></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN =
class=3D824364714-04012006>Sorry,=20
may not be too relevant to your problem, but I throw it into the mix in =
case it=20
triggers any thoughts.</SPAN></FONT></DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN=20
class=3D824364714-04012006></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN=20
class=3D824364714-04012006>Regards,</SPAN></FONT></DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN=20
class=3D824364714-04012006>Mike</SPAN></FONT></DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2><SPAN=20
class=3D824364714-04012006></SPAN></FONT>&nbsp;</DIV>
<BLOCKQUOTE dir=3Dltr=20
style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px =
solid; MARGIN-RIGHT: 0px">
  <DIV></DIV>
  <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr =
align=3Dleft><FONT=20
  face=3DTahoma size=3D2>-----Original Message-----<BR><B>From:</B>=20
  veritas-bu-admin AT mailman.eng.auburn DOT edu=20
  [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] <B>On Behalf Of =
</B>Barber,=20
  Layne (Contractor)<BR><B>Sent:</B> 04 January 2006 14:10<BR><B>To:</B> =

  veritas-bu AT mailman.eng.auburn DOT edu<BR><B>Subject:</B> [Veritas-bu] DLT =
drives=20
  going down<BR><BR></FONT></DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>We =
have an issue=20
  of drives randomly going down every night. NBU 5.0 mp5 HP-UX 11.11 STK =
L180 w/=20
  STK 3400 scsi bridge.</FONT></SPAN></DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>For =
some reason, 1=20
  or more drives go down at random every night when backups run. =
Different tapes=20
  and different drives. Backups will be running fine and then drives =
begin to go=20
  down. These are SDLT320 drives. once they go down, you can't use =
robtest to=20
  move the tapes (medium not present error) or use the robtest unload =
command=20
  (device not present).</FONT></SPAN></DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>If =
we power cycle=20
  the scsi bridge, we can talk to the drives and do what ever we want. =
STK is=20
  claiming that there is something coming from the host that is =
"polling" the=20
  library from the physical layer (assume HBA). We have had the SA for =
the=20
  master/media server disable any polling and load the latest patches =
from HP to=20
  no avail. We have changed from auto index to a manual map index as=20
  well.</FONT></SPAN></DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>This =
was working=20
  from the end of June up until the second week in =
October.</FONT></SPAN></DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
  size=3D2>Thoughts/suggestions?</FONT></SPAN></DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial size=3D2>Log =
snippets from=20
  last night:</FONT></SPAN></DIV>
  <DIV><SPAN class=3D222420014-04012006><FONT face=3DArial=20
  size=3D2></FONT></SPAN>&nbsp;</DIV><SPAN =
class=3D222420014-04012006><FONT=20
  face=3DArial size=3D2>
  <DIV><BR>syslog entries<BR>Jan&nbsp; 4 05:37:42 ujachr01 vmunix: SCSI =
TAPE:=20
  dev =3D 0xcd0801c0 I/O error during close<BR>Jan&nbsp; 4 05:50:10 =
ujachr01=20
  vmunix: SCSI TAPE: dev =3D 0xcd0801c0 I/O error during =
close<BR>Jan&nbsp; 4=20
  11:27:52 ujachr01 vmunix: SCSI TAPE: dev =3D 0xcd0800c0 I/O error =
during=20
  close<BR>Jan&nbsp; 4 11:34:36 ujachr01 tldcd[18968]: TLD(1) key =3D =
0x5, asc =3D=20
  0x3a, ascq =3D 0x0, MEDIUM NOT PRESENT<BR>Jan&nbsp; 4 11:34:36 =
ujachr01=20
  tldcd[18968]: TLD(1) Move_medium error<BR>Jan&nbsp; 4 11:34:36 =
ujachr01=20
  tldd[4233]: TLD(1) drive 5 (device 4) is being DOWNED, status: Robotic =

  dismount failure<BR>Jan&nbsp; 4 11:34:36 ujachr01 tldd[4233]: Check =
integrity=20
  of the drive, drive path, and media</DIV>
  <DIV>&nbsp;</DIV>
  <DIV>drive 5 (addr 504) access =3D 0 Contains Cartridge =3D =
yes<BR>Source address=20
  =3D 1119 (slot 120)<BR>Barcode =3D JA1156</DIV>
  <DIV>&nbsp;</DIV>
  <DIV><BR>Jan&nbsp; 4 11:55:12 ujachr01 tldcd[19684]: TLD(1) key =3D =
0x5, asc =3D=20
  0x3a, ascq =3D 0x0, MEDIUM NOT PRESENT<BR>Jan&nbsp; 4 11:55:12 =
ujachr01=20
  tldcd[19684]: TLD(1) Move_medium error<BR>Jan&nbsp; 4 11:55:12 =
ujachr01=20
  tldd[4233]: TLD(1) drive 1 (device 0) is being DOWNED, status: Robotic =

  dismount failure<BR>Jan&nbsp; 4 11:55:12 ujachr01 tldd[4233]: Check =
integrity=20
  of the drive, drive path, and media</DIV>
  <DIV>&nbsp;</DIV>
  <DIV>drive 1 (addr 500) access =3D 0 Contains Cartridge =3D =
yes<BR>Source address=20
  =3D 1106 (slot 107)<BR>Barcode =3D=20
JA1064<BR></FONT></SPAN></DIV></BLOCKQUOTE></BODY></HTML>

------_=_NextPart_001_01C6113E.799B8FB3--

<Prev in Thread] Current Thread [Next in Thread>