ADSM-L

[ADSM-L] The tale of the tape: library problems, or, why I've grown to hate tape!

2012-07-23 13:43:01
Subject: [ADSM-L] The tale of the tape: library problems, or, why I've grown to hate tape!
From: Richard Rhodes <rrhodes AT FIRSTENERGYCORP DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Mon, 23 Jul 2012 13:24:25 -0400
Hello,

This isn't a question as much as it is relaying frustrations with tape
drives, libraries and scsi reservations.  (ok, I'm dumping . . .there, I
said it!)

(note:  all comments relate to AIX v6, TSM v5.5, Atape 12.0.9.0, IBM 3584
lib, IBM 3592-E05 drives)

This past weekend we had two major tape library problems.


1)  The case of the burned library card.

Last Thursday at 6pm we had one of out 3584 lib die.  The circuit card on
the robot
where the long ribbon cable attaches (X-axis - down the length of the lib)
FRIED.
Yup, it had nice black burn markings.

The lib died with many drives in use.  We left these many tapes sitting in
the drives.
After getting the lib fixed we restarted the library manager instance. It
was interesting to watch.
 As the instance started up it initialized the lib - it tried to umount
all the tapes.  This was a SLOW
process (~30s per tape) as it dismounted them.  It was able to unload ALL
BUT 1 tape.
The tape that was stuck was in a drive that was in some invalid condition.

 When the library was booted up it  reported this drive as being
unavailable.  It was broke.
 We tried several halt/restarts of the library manager but it would not,
or could not, unmount that
one tape.  And, this isn't surprising since that drive was broke.
 We finally grabbed the one tape out of the drive, stuck it in the cap
door and let the lib put it away.

We then ran a audit, which failed.  Try again, which failed.  As we
watched the library during the
audit you could see the robot move over the broke drive that should have
had the last
 remaining tape and try/try/try to umount it!     But we had put it away
manually!
 Dumb system . . . just audit the lib and be done with it!!!!   We finally
got that tape out of it's
home slot, put in in the broken drive (not loaded, but sitting in the
unloaded position),
and started an audit.  The audit unmounted the tape and completed
successfully.


2)  The case of the not-communicating AIX instance.

Sometime early Sunday morning our other 3584 library lost contact with
it's library manager.
This is at a different data center - different lib and library manager.
There are dozens of processes on the library client TSM instances
waiting for tape to dismount and mount.  I start looking into things.

First the library.  Pull up Specialist, which says the library is OK.
The pull up the AIX instance for the library manager.  There are no AIX
errors  (errpt).
Check library client instances - no hdwr errors.
Check the san - all switches are up and good for +200 days.
Check server-to-server communications (ping server cmd) and the TSM
instances
can all communicate with each other.
Check TSM logs and find nothing that related to this problem.

I shutdown and restart the library manager.  I watch the library
Specialist
as TSM comes up.  I expect to see tape umounting, but nothing happens.
I let it sit for 15m or so . . . nothing.   It's like TSM still isn't
communicating
with the lib yet.

I then shutdown TSM library manager and reboot AIX.
This time as TSM comes up I can see the drives being dismounted
one at a time.  GOOD!  It unmounted all but 3 drives.  I never get the
message on the library mgr that the library is ready.
While 3 drives still have tapes in them, no new tape mounts occur.
The dozens of mounting/dismounting processes across the
library clients are still there.

I notice on the library mgr AIX instance that AIX threw 3 scsi reservation

errors for those 3 drives still with tapes.  OK, I've got to free the
reservation.
I pause the lib, put the tapes out of the drives, power cycle the drives,
and put the tapes back in them (in the unload position).  As soon
as the lib became available TSM unmounted the 3 drives.
GOOD,  Making progress!

As soon as the 3 drives were unmounted, TSM mounted 2 tapes.
One of the library clients fired up a reclamation.  This told me
I now had good communications from library client down to the library.
But, as I watched, none of the dozens of pending mounts/dismounts
on the clients was processed.  It's like the library mgr and clients
are out of sync and won't/can't get back in step.  This is especially
a problem because many of the processes that are waiting on
tapes are migrations.  If they don't get working I'm going to overflow
my disk pools.  I assume they will time out at some point, but I want
this working.  I don't want my disk pools blown because old migration
processes are hung.

I halt the one TSM client instance with the most pending
mounts/dismounts.  WHen it comes back up I am able to fire
up migration, which works!!!!   So, I now halt/restart all the TSM
library clients.

All is good!

(have I ever said I hate tape?)


3)  The case of scsi reservation processing.

In case 2 above I restarted all the library clients.  This was true of
all the main TSM instances at both datacenters.  The TSM instances
can access both tape libraries.  They have access to the local library
for primary pools, and access to the remote library for copy pools.
Depending on the instance, the hung mounts/dismounts were
mostly migrations or backup stgpool  processes.

So when I restarted the TSM library clients, many processes
accessing the  working library (which the previous day had the burnt card)
also went down.    This morning I see that the  library manager for that
other lib threw scsi reservation errors for every tape drive in use.
Apparently when that library mgr saw the loss of comminations
and tried to umnount the drives, but couldn't because the drives
were locked by the library clients.  And, when the TSM library clients
were back up, they did NOT free the scsi reservation on the
drives they had been using!  TSM kept retrying to free the locks.  At
exactly 2hr 15m later the dismounts succeeded.

Apparently scsi reservations will mostly free themselves up after
2hr 15m.  I say "mostly" because we've had many instances of
scsi reservation errors cause TSM to continuously try dismounting
a drives for several days until we caught it.  In these cases the only
solution to pause the lib, pull the tape out, power cycle the drive,
put the tape back in the drive (unloaded position), let the drive
initialize, and then TSM will umount the tape and use the drive.
For 2hr15m I had 23 drives that were unusable!

. . .I'm tired . . .  we didn't use to have these kinds of scsi
reservation problems.

(I hate tape - especially scsi reservation problems)


4)  Thoughts

The main problem I have with tape isn't the specific thing that
went wrong (library burned a card or rebooting AIX to get communications
back),
It's how TSM (v5.5) handles the mess that is left over.  The logic in TSM
for
handling tape messes is poor.

We've tried to get IBM support to help with these kinds of problems -
specially
with scsi reservation errors, but we've found that IBM support is all but
unable
to handle problem that cross their internal support setup.  It's very
frustrating.



Rick


-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.

<Prev in Thread] Current Thread [Next in Thread>