ADSM-L

Re: Normal # of failures on tape libraries

2005-12-20 12:19:11
Subject: Re: Normal # of failures on tape libraries
From: "Prather, Wanda" <Wanda.Prather AT JHUAPL DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 20 Dec 2005 12:19:04 -0500
1 drive failure per month, out of >34, may or may not be that far from
"normal", depending on your environment & workload.

In the first place, no one, including IBM,  has ever said that LTO
drives will take the kind of heavy-duty pounding that 359X drives will.

That is the difference between them, and why IBM sells both types of
drives:  LTO drives are designed to be inexpensive; 359X drives are
designed to be the best quality drives you can buy.  There is almost an
ORDER OF MAGNITUDE difference in the cost of an LTO1 drive and a 3592,
and there's a reason for it!

There is also a difference in what tends to cause drives to fail.
Lots of mounts/dismounts to read/write small files (which causes a lot
of start/stop activity moving the media) is a lot tougher on the drive
than if you just mount the tape and write 200GB of data from beginning
to end.

None of my sites are having any persistent problems with LTO drives (my
only LTO experience is with IBM drives), but they are also not
high-stress environments.

Now you say you have >34 LTO drives; that's a LOT in one site.
So I assume you must have a LOT of data & a LOT of activity.

I agree with everything the other posters have said:

If your drives are failing shortly after being replaced, you may have a
manufacturing problem or an installation problem.

If those drives are busy only a few hours each night dumping big data
bases, they shouldn't be failing often.  
If they are failing randomly with mechanical problems, look for
environmental problems (dust, heat, power).

If they are failing randomly with I/O errors, unreadable/unwriteable
data, causing data integrity problems, NOTHING should cause that.  Sit
on your vendor, keep them in the site CONSTANTLY until they have an
explanation.   Be sure you call the Field Engineering manager and stay
on his/her case AND keep the sales/marketing rep involved.  Sometimes
the field engineers reach the point they don't know what else to do.
But, I belive all major vendors have second-level regional experts, and
a third-level support team that can do a post-mortem on drives and
figure out what is causing the failure.  But you have to be a squeaky
wheel to get to that level, and you have to be persistent (thus the
reason you have to get the Field Engineering manager is involved).   You
have a LOT of hardware on the floor; yell until you get attention.

On the other hand, if you drives are taking a real pounding, busy MANY
hours a day with reclaims, migration, other TSM activity, but giving out
quietly without causing data integrity problems, you might just be
getting good value for the money!

Wanda Prather
"I/O, I/O, It's all about I/O"  -(me)




 




-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Dennis Melburn W IT743
Sent: Tuesday, December 13, 2005 2:30 PM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: Normal # of failures on tape libraries


Ahh, so it's the fact that they are LTO drives.  So as far as LTO drives
go then, what I am experiencing is "normal"? 


Mel Dennis

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Zoltan Forray/AC/VCU
Sent: Tuesday, December 13, 2005 2:26 PM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: [ADSM-L] Normal # of failures on tape libraries

I agree.  My 3590's (both B and E1A models) have been through major
pounding, for many, many years, and like the Energizer Bunny, keep going
and going. Yes, they do need some repairs/maintenance, but considering
the
amount of data/mounts/tapes they go through on a daily basis, they are
like tanks. Never had a whole drive, replaced. Usually things like
cleaning brushes, sometimes R/W heads, 2-3 card-packs, stuff like that.

This in contrast to my !@#$%^&*  IBM 3583/3580 LTO2 drives, which over
the
1.5-years I have been using them, all 8-drives have been replaced, at
least once, some more.  I haven't kept strict tabs on them, but
considering I just had 3-replaced over the past 2-weeks, from my
experience, LTO2 drives are garbage.  They require weekly, if not daily,
attention.  The 2-LTO libraries have 300-tapes between them, the 3494
library with the 3590 drives has over 3700, with 400+ mounts a day !

FWIW, when I went to a "storage" show-and-tell-and-try-to-sell, the ADIC
folks told me they OEM their drives from IBM !




Richard Sims <rbs AT BU DOT EDU>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
12/13/2005 02:01 PM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
Re: [ADSM-L] Normal # of failures on tape libraries






On Dec 13, 2005, at 11:31 AM, Dennis Melburn W IT743 wrote:

> Our sites use ADIC Scalar 1Ks as well as one ADIC 10K.  The Scalar 1Ks
> have  4 LTO1 drives in each and the 10K has 34 LTO2 drives.  We
> experience occasional failures on these drives and have to replace
> them.
> My question is, is it normal for a site that has alot of drives to
> experience drive failures about every 1-1.5 months?  My manager is
> rather annoyed at the fact that it seems that we are constantly
> replacing drives even though it doesn't cause any downtime for our TSM
> servers while they are being replaced.  If this is a normal part of
> having tape libraries then that is fine, but I don't have enough
> experience in this field to say either way, so that is why I am asking
> all of you.

Customers with 359x drives (which are never replaced) would certainly
find that replacement frequency alarming; and from any perspective,
that's rather extreme. Your site may have periodic management-level
review meetings with the vendor, where a good explanation should be
required of the vendor. Your management might then specify that if a
resolution to the problem is not forthcoming, then they might abandon
that vendor for another. (A complication there is that ADIC has been
the OEM for some name-brand drive resellers.) Make sure they review
external factors for cause, such as bad power feeding the drives,
excessive contaminants in the local atmosphere, tapes coming back
from offsite after rough handling, etc.

In any site where drive replacement occurs with any frequency, I
would advise chronicling the serial numbers of all such drives. You
would like to believe that you are getting new drives as
replacements, where the serial number should be nearby or higher than
that being replaced - and that you don't find the same drive coming
back sometime later.

    Richard Sims

<Prev in Thread] Current Thread [Next in Thread>