ADSM-L

Re: [ADSM-L] looking for experiences with the ibm 3584 library

2014-05-05 09:11:48
Subject: Re: [ADSM-L] looking for experiences with the ibm 3584 library
From: "Rhodes, Richard L." <rrhodes AT FIRSTENERGYCORP DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Mon, 5 May 2014 13:09:02 +0000
>We have been using a 3584 for about 12 years and have had 
>no issues at all with it. The only time it has been "down" 
>is for firmware upgrades, replacing tape drives (upgrade from 
>LTO2 to LTO4), and when we moved to our new datacenter. Very 
>stable and a great workhorse.

I generally agree with this. We love our 3584's (we have two).
They have been very good workhorses.  
BUT,  we have gone through some very frustrating problems with them!

1) The case of the frayed ribbon cable.

One of the libraries had the ribbon cable that connects
the library proper to the robot fray, which caused a short, which took out 
several cards.  It took IBM well over 30hr to resolve.  
I think we had 3 or 4 IBM'ers onsite trying to figure 
this problem out.  They wouldn't just order a bunch of parts.
They insisted in ordering parts one at a time as they 
decided to replace them.  The parts are all far away, causing
many, many hours of waiting.  

2)  The case of the mysterious gripper failures.

The robot would get stuck with the a tape suspended
between the robot gripper and the drive mouth.  The tape
cartridge pinned the robot. Both libraries were doing this. 
It got so bad the library would fail several times per day.
Many grippers were exchanged, it would work well for a while,
then go back to failing.  Long story short.  The cartridge slots
that line the walls of the library, as cartridges were 
inserted/removed, caused a powder (a light dust) to get on 
everything in the library, causing gripper failure.  
IBM had to replace all the plastic slot things in both libraries.  
This finally resolved this problem.

3)  The case of the slow console

Others have said this. There are certain options where it can 
go away for what seems like forever.  One thing I do 
once in a while is removing old cleaning cartridges.  If I 
get on auto-pilot and start hitting the menu items
without thinking, I will
inevitably hit this one item that requests something about all 
cartridges . . . .it goes away for what seems like forever
getting that list.

4)  The case of the Web console weirdness

The web console is simple to use and generally is great, but 
some functions simply don't work well.  For example, requesting
a tape to be moved to a specific element address may or may not
work.  We've never been able to figure out why it works some times,
and not others.  

Drive firmware upgrade can do flaky things.  We have 50 drives
in each 3584.  When I've performed a drive firmware upgrade 
on all drive, I can count on some number of drives that fail
the upgrade.  Sometimes it's all the drives in a frame that fail.
Those drives then have to be upgraded one at a time. 
(drive firmware upgrade options via the web console are 
All drives at once, or, one at a time).   Sometimes
out of 50 drives, a third will fail the upgrade. (This is doing
the upgrade live where you have the firmware activated on next umount).
I talked with the IBM folks about this, and the local CE thinks
this is caused by some communications timeout in the lib.
I opened a support case about this and got nowhere.  
Currently we have some old node cards requiring the older firmware.
With a scheduled upgrade we are getting all Enhanced node cards.
I'm hoping getting to the latest/greatest code resolves this.

5) The case of the useless dial home.

Our libraries are set up for dial home when a problem comes up.
Here we just shake our heads and sigh . . . 
Sometimes it will dial home on something as simple as a I/O
error writing to a tape, but sometimes won't dial home if the robot hangs.
It's almost a joke between us and the local CE's as to 
what/why/when it dials-home, or not.  No one can make sense of it.

6)  The case of the mysterious failing drive in frame 1 slot 12.

One of our libraries has a ongoing problem with one particular drive,
the drive in frame 1 drive slot 12.  This particular drive will fail 
any time it is powered off.  It goes into some weird unknown 
state that requires the drive to be replaced.  Yes, that drive has
been replaced many, many, many times over the years.  Firmware upgrade
that requires the drive to be power cycled to activate the code, 
it fails and needs replaced.  Get a scsi reservation problem that requires the
drive to be power cycled, it fails and needs replaced.  If the library has to
be powered off/on (IBM doing some upgrade or something), the drive fails 
gets replaced.  You would think that after all this time
IBM would figure out what is wrong - nope, they have no idea!

7) Atape - the mysterious of who within IBM owns it!

We all use atape on our hosts for the tape lib/drive driver.
If you ever suspect/have a problem with it, you will get nowhere in 
trying to get support from IBM.  Open a case on the 3584?  Nope,
we don't support that - it's host software.  Open a case with AIX?
Nope, that's not a AIX piece of sftw.  Open a case with TSM support?
Nope, they have nothing to do with it.


Now . . .as far as a 3494 vs 3584 . . . 

The 3584 is a SCSI library.  It is designed around the SCSI standard
for a tape library.  This isn't bad or good, it's just different than
now the 3494 works.  Probably the biggest thing to get used to is how
TSM (or any backup product) keeps a inventory of tape cartridges and the
Slots (element addresses) they are in.  You never had to think about this
for the 3494, since it was in charge of the slot the tapes were in.
Just spend some time reading up on SCSI libraries to get familiar
with them.




Rick




-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.