ADSM-L

I/O Errors and Tape Dismount issues

2005-02-16 07:38:33
Subject: I/O Errors and Tape Dismount issues
From: Joni Moyer <joni.moyer AT HIGHMARK DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Wed, 16 Feb 2005 07:38:08 -0500
Hi All!





I had put this issue out to the group the other day, but it appears as if
no one wants to take credit for these wonderful errors.  (I have a lot
more, but you get the idea.)  These errors have been occurring for some
time now and I still do not have success finding "what" is causing this
issue.  I have included info. from all vendors.  We have a TSM AIX 5.2.2.5
server connected to an STK SL8500 library with LTO2 drives.  We are using
Gresham EDT 6.4.6 to control drive sharing.  Any suggestions would be
appreciated.  I am at a loss...  Thanks!



Date/Time             Message

--------------------
----------------------------------------------------------
02/14/05 09:31:16     ANR8302E I/O error on drive SL8500 (/dev/rmt18)
(OP=REW,
                       Error Number=46, CC=0, KEY=02, ASC=3A, ASCQ=00,
SENSE=70
                       .00.02.00.00.00.00.1C.00.00.00.00.3
A.00.30.00.10.13.00.0
                       0.00.00.20.20.20.20.20.20.20.00.00.00.00.00.13.00,
Descr
                       iption=An undetermined error has occurred).  Refer
to Ap
                       pendix D in the 'Messages' manual for recommended
action.
                       (SESSION: 30844)

02/14/05 11:20:08     ANR8302E I/O error on drive SL8500 (/dev/rmt16)
(OP=REW,
                       Error Number=46, CC=0, KEY=02, ASC=3A, ASCQ=00,
SENSE=70
                       .00.02.00.00.00.00.1C.00.00.00.00.3
A.00.30.00.10.13.00.0
                       0.00.00.20.20.20.20.20.20.20.00.00.00.00.00.13.00,
Descr
                       iption=An undetermined error has occurred).  Refer
to Ap
                       pendix D in the 'Messages' manual for recommended
action.
                       (SESSION: 30844)

02/14/05 12:26:36     ANR8302E I/O error on drive SL8500 (/dev/rmt12)
(OP=WRITE,
                       Error Number=46, CC=0, KEY=02, ASC=04, ASCQ=02,
SENSE=70
                       .00.02.00.00.00.00.1
C.00.00.00.00.04.02.30.00.10.12.00.0
                       0.00.00.20.20.20.20.20.20.20.00.00.00.00.00.13.00,
Descr
                       iption=An undetermined error has occurred).  Refer
to Ap
                       pendix D in the 'Messages' manual for recommended
action.
                       (SESSION: 31877)




Ideas from IBM Support


This is extremely odd. As a matter of fact, this probably flat out
shouldn't happen.
-
Normally, I would think this is the customer having some sort of
pathing problem or libarary sharing protocol problem. But in this
case, TSM is more at the mercy of your library manager since we
don't know about drives and paths when using Gresham.
-
Here's why it shouldn't happen.
-
The OP code is LOCATE. Locate is not the first thing we do with a
drive. We have to open it, read the label, then maybe read some
more data, then issue locate. Meaning, we have had this drive
open, confirmed that it has the right tape in it, and then at some
point we do a locate to a block somewhere out in the middle of the
tape.
-
However, this error imples that there is no tape in the drive at
the time we issued the locate request. This can be about 2 things:
1. The drive had some problem and responded very much in error
to the situation.
2. Somewhere along the chain, some device sent a scsi command to
the wrong drive. This is probably much more likely.
-
I may be able to shed more light on this if you can send me the TSM
activity log from the time that process started, but I also may not.
If you suspect that you are having problems with some fancy device
you have between your TSM server and your tape drives, this is a
very likely explanation for the problem.


Ideas from STK support


TSM activity logs reveal that backups are succeeding although I do see some
write errors. However, there are a large number of LOCATE errors during
reclamation. This may indicate a problem in:


1. TSM configuration
2. Gresham component
3. Media
4. Drives (since the backups are successful, it is unlikely the problem is
on the drive side)


Ideas from Gresham Support
 searched through the TSM log for I/O errors, and then
cross-checked those against the EDT log.
I have attached a file which merges the I/O errors from
the TSM log and the
EDT diagnostics, so that we can see the order of the messages.

It looks to me like the problem starts with errors communicating
with the drive - through the data stream - with TSM.  Then,
apparently TSM tries to dismount the drive through EDT.  When
EDT tries to comply, it gets a DISMOUNT FAILURE and a LIBRARY
ERROR.

So it looks like both the control path (through EDT) and the data
path (through the device driver, directly to the drive) fail
at the same time.  The ACSLS server may be the problem, but
I suggest that there may be a single point of failure in the
communications path which affects both the control path and
the data path.  Perhaps you have a network component
which is intermittently failing?

I also suggest you look at the ACSLS logs at the
same time for clues.

********************************
Joni Moyer
Highmark
Storage Systems
Work:(717)302-6603
Fax:(717)302-5974
joni.moyer AT highmark DOT com
********************************

<Prev in Thread] Current Thread [Next in Thread>