ADSM-L

Re: [ADSM-L] looking for experiences with the ibm 3584 library

2014-05-05 10:47:23
Subject: Re: [ADSM-L] looking for experiences with the ibm 3584 library
From: Zoltan Forray <zforray AT VCU DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Mon, 5 May 2014 10:45:01 -0400
WOW - Deja Vu - We had these same problems on our 3494:


*1) The case of the frayed ribbon cable.*
*2) The case of the mysterious gripper failures.*

So, eventhough the 3584 is blazing fast, the basic
concept/structure/problems haven't changed radically!

We have just hit the 1-year mark with our 3584.  All TSM servers that
access it are RH Linux  so lin_tape is the driver.

We have had a couple of service calls throughout the year, but nothing like
with the 3494.  Some were drive problems.  Once was a known firmware
problem.

Our 3584 is 7-frames (1-L23 + 6-D23) with 17 TS1130 E06 drives. We did not
feel comfortable with the High-Density approach (i.e. stack tapes up to
4-deep horizontally and then play "3-card monte" when you need to get to
the 4th deep tape).  Currently 63% full with over a 1000 JA tapes in the
mix. If we needed to reduce the number of tapes in the library,  we would
replace those with JB tapes and get a 30%+ boost in tape capacity.




On Mon, May 5, 2014 at 9:09 AM, Rhodes, Richard L. <
rrhodes AT firstenergycorp DOT com> wrote:

> >We have been using a 3584 for about 12 years and have had
> >no issues at all with it. The only time it has been "down"
> >is for firmware upgrades, replacing tape drives (upgrade from
> >LTO2 to LTO4), and when we moved to our new datacenter. Very
> >stable and a great workhorse.
>
> I generally agree with this. We love our 3584's (we have two).
> They have been very good workhorses.
> BUT,  we have gone through some very frustrating problems with them!
>
> 1) The case of the frayed ribbon cable.
>
> One of the libraries had the ribbon cable that connects
> the library proper to the robot fray, which caused a short, which took out
> several cards.  It took IBM well over 30hr to resolve.
> I think we had 3 or 4 IBM'ers onsite trying to figure
> this problem out.  They wouldn't just order a bunch of parts.
> They insisted in ordering parts one at a time as they
> decided to replace them.  The parts are all far away, causing
> many, many hours of waiting.
>
> 2)  The case of the mysterious gripper failures.
>
> The robot would get stuck with the a tape suspended
> between the robot gripper and the drive mouth.  The tape
> cartridge pinned the robot. Both libraries were doing this.
> It got so bad the library would fail several times per day.
> Many grippers were exchanged, it would work well for a while,
> then go back to failing.  Long story short.  The cartridge slots
> that line the walls of the library, as cartridges were
> inserted/removed, caused a powder (a light dust) to get on
> everything in the library, causing gripper failure.
> IBM had to replace all the plastic slot things in both libraries.
> This finally resolved this problem.
>
> 3)  The case of the slow console
>
> Others have said this. There are certain options where it can
> go away for what seems like forever.  One thing I do
> once in a while is removing old cleaning cartridges.  If I
> get on auto-pilot and start hitting the menu items
> without thinking, I will
> inevitably hit this one item that requests something about all
> cartridges . . . .it goes away for what seems like forever
> getting that list.
>
> 4)  The case of the Web console weirdness
>
> The web console is simple to use and generally is great, but
> some functions simply don't work well.  For example, requesting
> a tape to be moved to a specific element address may or may not
> work.  We've never been able to figure out why it works some times,
> and not others.
>
> Drive firmware upgrade can do flaky things.  We have 50 drives
> in each 3584.  When I've performed a drive firmware upgrade
> on all drive, I can count on some number of drives that fail
> the upgrade.  Sometimes it's all the drives in a frame that fail.
> Those drives then have to be upgraded one at a time.
> (drive firmware upgrade options via the web console are
> All drives at once, or, one at a time).   Sometimes
> out of 50 drives, a third will fail the upgrade. (This is doing
> the upgrade live where you have the firmware activated on next umount).
> I talked with the IBM folks about this, and the local CE thinks
> this is caused by some communications timeout in the lib.
> I opened a support case about this and got nowhere.
> Currently we have some old node cards requiring the older firmware.
> With a scheduled upgrade we are getting all Enhanced node cards.
> I'm hoping getting to the latest/greatest code resolves this.
>
> 5) The case of the useless dial home.
>
> Our libraries are set up for dial home when a problem comes up.
> Here we just shake our heads and sigh . . .
> Sometimes it will dial home on something as simple as a I/O
> error writing to a tape, but sometimes won't dial home if the robot hangs.
> It's almost a joke between us and the local CE's as to
> what/why/when it dials-home, or not.  No one can make sense of it.
>
> 6)  The case of the mysterious failing drive in frame 1 slot 12.
>
> One of our libraries has a ongoing problem with one particular drive,
> the drive in frame 1 drive slot 12.  This particular drive will fail
> any time it is powered off.  It goes into some weird unknown
> state that requires the drive to be replaced.  Yes, that drive has
> been replaced many, many, many times over the years.  Firmware upgrade
> that requires the drive to be power cycled to activate the code,
> it fails and needs replaced.  Get a scsi reservation problem that requires
> the
> drive to be power cycled, it fails and needs replaced.  If the library has
> to
> be powered off/on (IBM doing some upgrade or something), the drive fails
> gets replaced.  You would think that after all this time
> IBM would figure out what is wrong - nope, they have no idea!
>
> 7) Atape - the mysterious of who within IBM owns it!
>
> We all use atape on our hosts for the tape lib/drive driver.
> If you ever suspect/have a problem with it, you will get nowhere in
> trying to get support from IBM.  Open a case on the 3584?  Nope,
> we don't support that - it's host software.  Open a case with AIX?
> Nope, that's not a AIX piece of sftw.  Open a case with TSM support?
> Nope, they have nothing to do with it.
>
>
> Now . . .as far as a 3494 vs 3584 . . .
>
> The 3584 is a SCSI library.  It is designed around the SCSI standard
> for a tape library.  This isn't bad or good, it's just different than
> now the 3494 works.  Probably the biggest thing to get used to is how
> TSM (or any backup product) keeps a inventory of tape cartridges and the
> Slots (element addresses) they are in.  You never had to think about this
> for the 3494, since it was in charge of the slot the tapes were in.
> Just spend some time reading up on SCSI libraries to get familiar
> with them.
>
>
>
>
> Rick
>
>
>
>
> -----------------------------------------
> The information contained in this message is intended only for the
> personal and confidential use of the recipient(s) named above. If
> the reader of this message is not the intended recipient or an
> agent responsible for delivering it to the intended recipient, you
> are hereby notified that you have received this document in error
> and that any review, dissemination, distribution, or copying of
> this message is strictly prohibited. If you have received this
> communication in error, please notify us immediately, and delete
> the original message.




--
*Zoltan Forray*
TSM Software & Hardware Administrator
BigBro / Hobbit / Xymon Administrator
Virginia Commonwealth University
UCC/Office of Technology Services
zforray AT vcu DOT edu - 804-828-4807
Don't be a phishing victim - VCU and other reputable organizations will
never use email to request that you reply with your password, social
security number or confidential personal information. For more details
visit http://infosecurity.vcu.edu/phishing.html