Veritas-bu

[Veritas-bu] SuperDLT's and L180 - bad tape, drives, cables, host cards or ju st a curse from small leprechauns

2002-09-23 09:39:10
Subject: [Veritas-bu] SuperDLT's and L180 - bad tape, drives, cables, host cards or ju st a curse from small leprechauns
From: jason.cordes AT letigre DOT com (jason.cordes AT letigre DOT com)
Date: Mon, 23 Sep 2002 08:39:10 -0500
I have not done any installs on Solaris, only tru64 and windows but all my
installs have been done using SDLT libraries. And since I started working
with Netbackup last year I've had nothing but frustration with downed
drives. But since this last revision of firmware for the SDLT drives, I
believe it was the drives, not netbackup that was causing me the trouble.

Since I've upgraded to firmware version 45 I haven't seen any problems
whatsoever with downed drives or any errors at all other than a bad tape now
and then.

Jason Cordes
LeTigre Computing
713.681.8844
jason.cordes AT letigre DOT com

-----Original Message-----
From: Jason Alexander [mailto:Jason.Alexander AT ncfe DOT com]
Sent: Wednesday, September 18, 2002 2:56 PM
To: 'veritas-bu AT mailman.eng.auburn DOT edu'
Subject: [Veritas-bu] SuperDLT's and L180 - bad tape, drives, cables, host
cards or ju st a curse from small leprechauns

Greetings,

  We have an L180 that has seven SDLT drives and 84 slots.  Netbackup DC
3.4GA on Solaris 8.  SDLT drives have code revision 38.   mount/dismount
loops do not fail from the robot.  There are 3 dual-channel HVD scsi host
cards in a Sun E420R.  Three drives have dedicated channels and two of the
channels have two drives.  The drives were purchased in March, 2002.  I am
trying to find any information that could help me pinpoint a solution to our
L180 failures.  I have tried switching scsi cables (no change), power
cycling drives/host/L180 (no change), replaced two of the 7 drives (the
first of which is now dead after having to disassemble it to get a tape
out).  The Solaris host is on a different power circuit than the L180 if
that matters.  Any pointers would be appreciated.  Thanks!

  We have been getting a lot of scsi errors and drives being DOWN'ed by the
tldd process.  There aren't any errors in the /usr/openv/netbackup/logs
subdirectories for troubleshooting.  All of the errors end up in
/var/adm/messages through syslog.  We have an average of 2 drives, out of 7,
being DOWN'ed every night and we have 4 drives fail this past weekend with 1
drive fatally wounded by a jammed tape.  The leader was off of the drive and
the tape was a little blemished and twisted.  STK is replacing that
particular drive.  However, other problems persist.  In the past week, drive
0 (st21) and drive 6 (st53) have been down almost every day, except last
night, only drive 6 went down.

  I really only have background information from the beginning of August
when I started working on this issue.  There are seldom scsi timeout errors
in the mix of the messages, but there are many load/write/read/unload
errors.  Here is a recent sample from this weekend:

// VERY common
Sep 16 15:17:53 backserv scsi: [ID 107833 kern.warning] WARNING:
/pci@1f,4000/sc
si@2,1/st@0,0 (st21):
Sep 16 15:17:53 backserv        Error for Command: write file mark
Error
 Level: Fatal
Sep 16 15:17:53 backserv scsi: [ID 107833 kern.notice]  Requested Block: 1

                   Error Block: 1
Sep 16 15:17:53 backserv scsi: [ID 107833 kern.notice]  Vendor: QUANTUM

                   Serial Number:  $  < i    
Sep 16 15:17:53 backserv scsi: [ID 107833 kern.notice]  Sense Key: Media
Error
Sep 16 15:17:53 backserv scsi: [ID 107833 kern.notice]  ASC: 0x81 (<vendor
uniqu
e code 0x81>), ASCQ: 0x0, FRU: 0x0
Sep 16 15:17:58 backserv ltid[233]: [ID 560357 daemon.notice] LTID - Sent
ROBOTI
C request, Type=3, Param2=0
Sep 16 15:17:58 backserv tldd[240]: [ID 785360 daemon.notice] TLD(0)
DismountTap
e 000725 from drive 1

// VERY common
Sep 17 08:53:09 backserv scsi: [ID 107833 kern.warning] WARNING:
/pci@1f,4000/sc
si@2,1/st@0,0 (st21):
Sep 17 08:53:09 backserv        Error for Command: write
Error
 Level: Fatal
Sep 17 08:53:09 backserv scsi: [ID 107833 kern.notice]  Requested Block:
17031 
                   Error Block: 17031
Sep 17 08:53:09 backserv scsi: [ID 107833 kern.notice]  Vendor: QUANTUM

                   Serial Number:  $  < i    
Sep 17 08:53:09 backserv scsi: [ID 107833 kern.notice]  Sense Key: Media
Error
Sep 17 08:53:09 backserv scsi: [ID 107833 kern.notice]  ASC: 0xc (write
error),
ASCQ: 0x0, FRU: 0x0
Sep 17 08:56:10 backserv ltid[233]: [ID 560357 daemon.notice] LTID - Sent
ROBOTI
C request, Type=3, Param2=0
Sep 17 08:56:10 backserv tldd[240]: [ID 729286 daemon.notice] TLD(0)
DismountTap
e 000661 from drive 1
Sep 17 08:56:37 backserv ltid[233]: [ID 560360 daemon.notice] LTID - Sent
ROBOTI
C request, Type=3, Param2=3


//fairly common
Sep 16 15:18:00 backserv tldcd[9497]: [ID 355708 daemon.notice] TLD(0)
opening r
obotic path /dev/sg/c2t0l0
Sep 16 15:18:02 backserv tldcd[9497]: [ID 559680 daemon.notice] TLD(0)
closing/u
nlocking robotic path
Sep 16 15:19:17 backserv scsi: [ID 107833 kern.warning] WARNING:
/pci@1f,4000/sc
si@2,1/st@0,0 (st21):
Sep 16 15:19:17 backserv        Error for Command: load/start/stop
Error
 Level: Fatal
Sep 16 15:19:17 backserv scsi: [ID 107833 kern.notice]  Requested Block: 0

                   Error Block: 0
Sep 16 15:19:17 backserv scsi: [ID 107833 kern.notice]  Vendor: QUANTUM

                   Serial Number:  $  < i    
Sep 16 15:19:17 backserv scsi: [ID 107833 kern.notice]  Sense Key: Media
Error
Sep 16 15:19:17 backserv scsi: [ID 107833 kern.notice]  ASC: 0x81 (<vendor
uniqu
e code 0x81>), ASCQ: 0x0, FRU: 0x0
Sep 16 15:19:17 backserv tldd[9438]: [ID 861947 daemon.error] TLD(0) unload
fail
ed in io_open, I/O error[5]
Sep 16 15:19:17 backserv tldd[240]: [ID 976563 daemon.notice]
DecodeDismount():
TLD(0) drive 1, Actual status: Unable to SCSI unload drive
Sep 16 15:19:17 backserv tldd[240]: [ID 821045 daemon.error] TLD(0) drive 1
(dev
ice 0) is being DOWNED, status: Unable to SCSI unload drive
Sep 16 15:19:17 backserv tldd[240]: [ID 229259 daemon.error] Check integrity
of
the drive, drive path, and media

// uncommon, but present:
Sep 16 18:23:21 backserv        Disconnected command timeout for Target 2.0
Sep 16 18:23:21 backserv genunix: [ID 408822 kern.info] NOTICE: glm5: fault
dete
cted in device; service still available
Sep 16 18:23:21 backserv genunix: [ID 611667 kern.info] NOTICE: glm5:
Disconnect
ed command timeout for Target 2.0
Sep 16 18:23:21 backserv glm: [ID 160360 kern.warning] WARNING:
ID[SUNWpd.glm.cm
d_timeout.6016]
Sep 16 18:23:21 backserv scsi: [ID 107833 kern.warning] WARNING:
/pci@1f,4000/sc
si@5,1/st@2,0 (st37):
Sep 16 18:23:21 backserv        SCSI transport failed: reason 'timeout':
giving
up
Sep 16 18:23:24 backserv ltid[233]: [ID 560359 daemon.notice] LTID - Sent
ROBOTI
C request, Type=3, Param2=2
Sep 16 18:23:24 backserv tldd[240]: [ID 275090 daemon.notice] TLD(0)
DismountTap
e 000711 from drive 3
Sep 16 18:23:25 backserv ltid[233]: [ID 527591 daemon.notice] LTID - Sent
ROBOTI
C request, Type=1, Param2=2
Sep 16 18:24:19 backserv tldcd[251]: [ID 858764 daemon.notice] Processing
UNMOUN
T, TLD(0) drive 3, slot 22, barcode 000711  , vsn 000711
Sep 16 18:24:19 backserv tldcd[14750]: [ID 355708 daemon.notice] TLD(0)
opening
robotic path /dev/sg/c2t0l0
Sep 16 18:24:19 backserv tldcd[14750]: [ID 559120 daemon.notice] TLD(0)
initiati
ng MOVE_MEDIUM from addr 502 to addr 1021
Sep 16 18:24:19 backserv tldcd[14750]: [ID 183166 daemon.error] TLD(0) key =
0x5
, asc = 0x3a, ascq = 0x0, MEDIUM NOT PRESENT
Sep 16 18:24:19 backserv tldcd[14750]: [ID 841094 daemon.error] TLD(0)
Move_medi
um error: CHECK CONDITION
Sep 16 18:24:19 backserv tldcd[14750]: [ID 559680 daemon.notice] TLD(0)
closing/
unlocking robotic path
Sep 16 18:24:20 backserv tldd[240]: [ID 855042 daemon.notice]
DecodeDismount():
TLD(0) drive 3, Actual status: Robotic dismount failure
Sep 16 18:24:20 backserv tldd[240]: [ID 988737 daemon.error] TLD(0) drive 3
(dev
ice 2) is being DOWNED, status: Robotic dismount failure


-Jason
This message contains information intended only for the use of the
addressee(s) named above, and may contain information that is privileged and
confidential.  If you are not the intended recipient of this message, or the
employee or agent responsible for delivering it to the intended recipient,
you are hereby notified that any dissemination or copying of this message is
strictly prohibited.  If you have received this message in error, please
immediately notify us by replying by e-mail and destroy the original
message.  Thank you.
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

<Prev in Thread] Current Thread [Next in Thread>