ADSM-L

Re: I/O Errors and Tape Dismount issues

2005-02-16 07:42:05
Subject: Re: I/O Errors and Tape Dismount issues
From: Iain Barnetson <Iain.Barnetson AT HALLIBURTON DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Wed, 16 Feb 2005 12:41:23 -0000
Don't you just love being stuck in the middle of multiple supplier
support teams ;) 


Regards,

Iain Barnetson
IT Systems Administrator
UKN Infrastructure Operations

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Joni Moyer
Sent: 16 February 2005 12:38
To: ADSM-L AT VM.MARIST DOT EDU
Subject: [ADSM-L] I/O Errors and Tape Dismount issues

Hi All!





I had put this issue out to the group the other day, but it appears as
if no one wants to take credit for these wonderful errors.  (I have a
lot more, but you get the idea.)  These errors have been occurring for
some time now and I still do not have success finding "what" is causing
this issue.  I have included info. from all vendors.  We have a TSM AIX
5.2.2.5 server connected to an STK SL8500 library with LTO2 drives.  We
are using Gresham EDT 6.4.6 to control drive sharing.  Any suggestions
would be appreciated.  I am at a loss...  Thanks!



Date/Time             Message

--------------------
----------------------------------------------------------
02/14/05 09:31:16     ANR8302E I/O error on drive SL8500 (/dev/rmt18)
(OP=REW,
                       Error Number=46, CC=0, KEY=02, ASC=3A, ASCQ=00,
SENSE=70
                       .00.02.00.00.00.00.1C.00.00.00.00.3
A.00.30.00.10.13.00.0
 
0.00.00.20.20.20.20.20.20.20.00.00.00.00.00.13.00,
Descr
                       iption=An undetermined error has occurred).
Refer to Ap
                       pendix D in the 'Messages' manual for recommended
action.
                       (SESSION: 30844)

02/14/05 11:20:08     ANR8302E I/O error on drive SL8500 (/dev/rmt16)
(OP=REW,
                       Error Number=46, CC=0, KEY=02, ASC=3A, ASCQ=00,
SENSE=70
                       .00.02.00.00.00.00.1C.00.00.00.00.3
A.00.30.00.10.13.00.0
 
0.00.00.20.20.20.20.20.20.20.00.00.00.00.00.13.00,
Descr
                       iption=An undetermined error has occurred).
Refer to Ap
                       pendix D in the 'Messages' manual for recommended
action.
                       (SESSION: 30844)

02/14/05 12:26:36     ANR8302E I/O error on drive SL8500 (/dev/rmt12)
(OP=WRITE,
                       Error Number=46, CC=0, KEY=02, ASC=04, ASCQ=02,
SENSE=70
                       .00.02.00.00.00.00.1
C.00.00.00.00.04.02.30.00.10.12.00.0
 
0.00.00.20.20.20.20.20.20.20.00.00.00.00.00.13.00,
Descr
                       iption=An undetermined error has occurred).
Refer to Ap
                       pendix D in the 'Messages' manual for recommended
action.
                       (SESSION: 31877)




Ideas from IBM Support


This is extremely odd. As a matter of fact, this probably flat out
shouldn't happen.
-
Normally, I would think this is the customer having some sort of pathing
problem or libarary sharing protocol problem. But in this case, TSM is
more at the mercy of your library manager since we don't know about
drives and paths when using Gresham.
-
Here's why it shouldn't happen.
-
The OP code is LOCATE. Locate is not the first thing we do with a drive.
We have to open it, read the label, then maybe read some more data, then
issue locate. Meaning, we have had this drive open, confirmed that it
has the right tape in it, and then at some point we do a locate to a
block somewhere out in the middle of the tape.
-
However, this error imples that there is no tape in the drive at the
time we issued the locate request. This can be about 2 things:
1. The drive had some problem and responded very much in error to the
situation.
2. Somewhere along the chain, some device sent a scsi command to the
wrong drive. This is probably much more likely.
-
I may be able to shed more light on this if you can send me the TSM
activity log from the time that process started, but I also may not.
If you suspect that you are having problems with some fancy device you
have between your TSM server and your tape drives, this is a very likely
explanation for the problem.


Ideas from STK support


TSM activity logs reveal that backups are succeeding although I do see
some write errors. However, there are a large number of LOCATE errors
during reclamation. This may indicate a problem in:


1. TSM configuration
2. Gresham component
3. Media
4. Drives (since the backups are successful, it is unlikely the problem
is on the drive side)


Ideas from Gresham Support
 searched through the TSM log for I/O errors, and then cross-checked
those against the EDT log.
I have attached a file which merges the I/O errors from the TSM log and
the EDT diagnostics, so that we can see the order of the messages.

It looks to me like the problem starts with errors communicating with
the drive - through the data stream - with TSM.  Then, apparently TSM
tries to dismount the drive through EDT.  When EDT tries to comply, it
gets a DISMOUNT FAILURE and a LIBRARY ERROR.

So it looks like both the control path (through EDT) and the data path
(through the device driver, directly to the drive) fail at the same
time.  The ACSLS server may be the problem, but I suggest that there may
be a single point of failure in the communications path which affects
both the control path and the data path.  Perhaps you have a network
component which is intermittently failing?

I also suggest you look at the ACSLS logs at the same time for clues.

********************************
Joni Moyer
Highmark
Storage Systems
Work:(717)302-6603
Fax:(717)302-5974
joni.moyer AT highmark DOT com
********************************

<Prev in Thread] Current Thread [Next in Thread>