Veritas-bu

[Veritas-bu] HELP - media and I/O errors

2004-02-17 10:48:50
Subject: [Veritas-bu] HELP - media and I/O errors
From: pspotts AT geisinger DOT edu (Paul Spotts)
Date: Tue, 17 Feb 2004 10:48:50 -0500
This is a MIME message. If you are reading this text, you may want to 
consider changing to a mail reader or gateway that understands how to 
properly handle MIME multipart messages.

--=_DBFAA4F6.67073595
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

We have an L700 with IBM LTO2 drives. We had many nights with the same me=
dia errors. We worked at it for about a week. We had the same suggestion =
(from STK)  of changing the media unmount delay, and it did absolutely no=
thing.  Finally STK upgraded the code on the drives to Firmware version 3=
8D0 and we have not had an error since.=20



Paul Spotts
Server Management Group=20
(570)271-5180
pspotts AT geisinger DOT edu

>>> George Drew <gdrew AT deathstar DOT org> 2/17/04 9:58:19 AM >>>
Changing the Media Unmount Delay will have absolutely no effect on this
*at all*. Media unmount delay is the amount of time nbu waits from the
end of a *user* backup to the time nbu issues the unload. In other
words, setting this is not going to affect how long nbu waits for a
rewind to complete (a silly idea anyway, as tape scsi commands are
synchronous - nbu *must* wait for the unload to complete before it can
do anything else), because nbu doesn't even send the rewind until media
unmount delay expires.

There are some issues related to how nbu reacts to finding a tape in a
drive that it didn't put there, and you should ensure that all media
servers are patched to MP6 or FP6.

George

On Mon, 16 Feb 2004, Denis Petrov wrote:

> MessageI had simillar issue on L80. STK tech suggested that issue is ma=
y be related to the Netbackup setting of "Media unmount delay" which is w=
ay too short by default 3 minutes. What is happening tape is getting unmo=
unted before it completely rewound. I had some arguments with my co-worke=
rs about it.... Speed of LTO's vs DLT's, since the issues that never came=
 up when we used DLTs, but I was able to confirm the issues with my L80 l=
ogs some are exactly what the tech suggested and some are similar. In add=
ition the issues does not seem to show up right away until LTO tapes have=
 significant amount of data - takes longer to rewind???... . I say if eve=
rything failes try to change Media unmount delay to something like 15 min=
utes or so and see if it makes any difference
>
> --Denis
>
>   ----- Original Message -----
>   From: Dave Markham
>   To: Sokolowski Ric-ERS004 ; veritas-bu AT mailman.eng.auburn DOT edu
>   Sent: Wednesday, February 11, 2004 9:23 AM
>   Subject: Re: [Veritas-bu] HELP - media and I/O errors
>
>
>   Have the cables connecting the devices been replaced?
>
>   I have similar problem recently with LTO drives and tried many things=
=2E My environment was Solaris and there were some patches to apply ( alt=
hough that doesn't help sorry ), but the cables were mentioned plus I fou=
nd LTO media has a chip inside it which can be dislodged. If you shake th=
e tapes and they rattle loudly then most likely they are damaged. This co=
uld be more than one tape if they have come from the same batch perhaps.
>
>   Just some ideas
>   Dave
>     ----- Original Message -----
>     From: Sokolowski Ric-ERS004
>     To: 'veritas-bu AT mailman.eng.auburn DOT edu'
>     Sent: Wednesday, February 11, 2004 4:28 PM
>     Subject: [Veritas-bu] HELP - media and I/O errors
>
>
>     Our system:
>
>     NB 4.5 MP5
>     master - HP-UX 11.00
>     media - 4 HP-UX 11.00, 1 HP-UX 11.11
>     STK L700 (HP20/700) w/10 HP LTO 1 drives w/SSO
>     5 HP 2/1 FC/SCSI bridges
>     1 Brocade 2800
>
>     We're seeing tons of media-related errors (70% status 86 - media po=
sition, 30% status 84 - media write) spread across
>     all drives.  Some nights we see no errors, other nights we'll see 5=
0-100 media-related failures.  We see the failures when
>     reusing tapes and with brand new tapes.  All drives have been clean=
ed recently.  We have had cases open w/Veritas and
>     HP for just over 4 weeks now.  Veritas has examined over a months w=
orth of log files and has determined that the
>     problem is hardware related.  HP replaced 3 drives, we saw media fa=
ilures on these 3 new drives the same day they were
>     replaced.  HP also replaced the robot controller, the camera, and o=
ne of the Fibre bridges.  We're not seeing any
>     communication errors on the FC switch.  Everything has the latest a=
vailable firmware.  Whenever we get the status 84/86,
>     we see a  lot of things like "cannot read from media socket 10", "i=
octl (MTREW) failed on media id 402280, drive index 4,
>     I/O error (bptm.c.7197)" and "write error on media id 402280, drive=
 index 4, writing header block, I/O error".  Normally,
>     between 2 and 5 drives are downed every night - always with a tape =
stuck in the the drive.  Occasionally the system will
>     freeze dozens of tapes because they're seen as "unmountable" which =
leads to a boatload of status 96 (no media)
>     failures.  Our backup success rate has dropped from over 98% to bel=
ow 80% - management is freaking out.  We're
>     grasping at straws here folks, any help would be GREATLY appreciate=
d!
>
>     --
>     Regards,
>     Ric Sokolowski (Ric.Sokolowski AT motorola DOT com)
>     Staff Systems Engineer
>     Phone: (954) 723-6332
>     Pager: 9545530742 AT messaging.nextel DOT com
>     Motorola, Inc.  / CGISS / Enterprise Computing
>     8000 West Sunrise Blvd, MS 22-2F, Plantation, FL 33322
>
>
>
>
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


IMPORTANT WARNING: The information in this message (and the documents att=
ached to it, if any) is confidential and may be legally privileged. It is=
 intended solely for the addressee. Access to this message by anyone else=
 is unauthorized. If you are not the intended recipient, any disclosure, =
copying, distribution or any action taken, or omitted to be taken, in rel=
iance on it is prohibited and may be unlawful. If you have received this =
message in error, please delete all electronic copies of this message (an=
d the documents attached to it, if any), destroy any hard copies you may =
have created and notify me immediately by replying to this email. Thank y=
ou.
--=_DBFAA4F6.67073595
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Description: HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; charset=3Diso-8859-=
1">
<META content=3D"MSHTML 6.00.2600.0" name=3DGENERATOR></HEAD>
<BODY style=3D"MARGIN-TOP: 2px; FONT: 10pt Times New Roman; MARGIN-LEFT: =
2px">
<DIV>We have an L700 with IBM LTO2 drives. We had many nights with the sa=
me=20
media errors. We worked at it for about a week. We had the same suggestio=
n (from=20
STK) &nbsp;of changing the media unmount delay, and it did absolutely=20
nothing.&nbsp; Finally STK upgraded the code on the drives to Firmware ve=
rsion=20
38D0 and we have not had an error since. </DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>Paul Spotts<BR>Server Management Group <BR>(570)271-5180<BR><A=20
href=3D"mailto:pspotts AT geisinger DOT edu">pspotts AT geisinger DOT 
edu</A><BR><BR>&g=
t;&gt;&gt;=20
George Drew &lt;gdrew AT deathstar DOT org&gt; 2/17/04 9:58:19 AM=20
&gt;&gt;&gt;<BR>Changing the Media Unmount Delay will have absolutely no =
effect=20
on this<BR>*at all*. Media unmount delay is the amount of time nbu waits =
from=20
the<BR>end of a *user* backup to the time nbu issues the unload. In=20
other<BR>words, setting this is not going to affect how long nbu waits fo=
r=20
a<BR>rewind to complete (a silly idea anyway, as tape scsi commands=20
are<BR>synchronous - nbu *must* wait for the unload to complete before it=
=20
can<BR>do anything else), because nbu doesn't even send the rewind until=20
media<BR>unmount delay expires.<BR><BR>There are some issues related to h=
ow nbu=20
reacts to finding a tape in a<BR>drive that it didn't put there, and you =
should=20
ensure that all media<BR>servers are patched to MP6 or=20
FP6.<BR><BR>George<BR><BR>On Mon, 16 Feb 2004, Denis Petrov wrote:<BR><BR=
>&gt;=20
MessageI had simillar issue on L80. STK tech suggested that issue is may =
be=20
related to the Netbackup setting of "Media unmount delay" which is way to=
o short=20
by default 3 minutes. What is happening tape is getting unmounted before =
it=20
completely rewound. I had some arguments with my co-workers about it.... =
Speed=20
of LTO's vs DLT's, since the issues that never came up when we used DLTs,=
 but I=20
was able to confirm the issues with my L80 logs some are exactly what the=
 tech=20
suggested and some are similar. In addition the issues does not seem to s=
how up=20
right away until LTO tapes have significant amount of data - takes longer=
 to=20
rewind???... . I say if everything failes try to change Media unmount del=
ay to=20
something like 15 minutes or so and see if it makes any=20
difference<BR>&gt;<BR>&gt; --Denis<BR>&gt;<BR>&gt;&nbsp;&nbsp; ----- Orig=
inal=20
Message -----<BR>&gt;&nbsp;&nbsp; From: Dave Markham<BR>&gt;&nbsp;&nbsp; =
To:=20
Sokolowski Ric-ERS004 ; veritas-bu AT mailman.eng.auburn DOT 
edu<BR>&gt;&nbsp;&n=
bsp;=20
Sent: Wednesday, February 11, 2004 9:23 AM<BR>&gt;&nbsp;&nbsp; Subject: R=
e:=20
[Veritas-bu] HELP - media and I/O errors<BR>&gt;<BR>&gt;<BR>&gt;&nbsp;&nb=
sp;=20
Have the cables connecting the devices been=20
replaced?<BR>&gt;<BR>&gt;&nbsp;&nbsp; I have similar problem recently wit=
h LTO=20
drives and tried many things. My environment was Solaris and there were s=
ome=20
patches to apply ( although that doesn't help sorry ), but the cables wer=
e=20
mentioned plus I found LTO media has a chip inside it which can be dislod=
ged. If=20
you shake the tapes and they rattle loudly then most likely they are dama=
ged.=20
This could be more than one tape if they have come from the same batch=20
perhaps.<BR>&gt;<BR>&gt;&nbsp;&nbsp; Just some ideas<BR>&gt;&nbsp;&nbsp;=20
Dave<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; ----- Original Message=20
-----<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; From: Sokolowski=20
Ric-ERS004<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; To:=20
'veritas-bu AT mailman.eng.auburn DOT edu'<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; 
Sent:=
=20
Wednesday, February 11, 2004 4:28 PM<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; Subj=
ect:=20
[Veritas-bu] HELP - media and I/O=20
errors<BR>&gt;<BR>&gt;<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; Our=20
system:<BR>&gt;<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; NB 4.5=20
MP5<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; master - HP-UX=20
11.00<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; media - 4 HP-UX 11.00, 1 HP-UX=20
11.11<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; STK L700 (HP20/700) w/10 HP LTO 1 d=
rives=20
w/SSO<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; 5 HP 2/1 FC/SCSI=20
bridges<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; 1 Brocade=20
2800<BR>&gt;<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; We're seeing tons of media-r=
elated=20
errors (70% status 86 - media position, 30% status 84 - media write) spre=
ad=20
across<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; all drives.&nbsp; Some nights we s=
ee no=20
errors, other nights we'll see 50-100 media-related failures.&nbsp; We se=
e the=20
failures when<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; reusing tapes and with bran=
d new=20
tapes.&nbsp; All drives have been cleaned recently.&nbsp; We have had cas=
es open=20
w/Veritas and<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; HP for just over 4 weeks=20
now.&nbsp; Veritas has examined over a months worth of log files and has=20
determined that the<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; problem is hardware=20
related.&nbsp; HP replaced 3 drives, we saw media failures on these 3 new=
 drives=20
the same day they were<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; replaced.&nbsp; HP=
 also=20
replaced the robot controller, the camera, and one of the Fibre bridges.&=
nbsp;=20
We're not seeing any<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; communication errors=
 on the=20
FC switch.&nbsp; Everything has the latest available firmware.&nbsp; When=
ever we=20
get the status 84/86,<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; we see a&nbsp; lot =
of=20
things like "cannot read from media socket 10", "ioctl (MTREW) failed on =
media=20
id 402280, drive index 4,<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; I/O error=20
(bptm.c.7197)" and "write error on media id 402280, drive index 4, writin=
g=20
header block, I/O error".&nbsp; Normally,<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp;=
=20
between 2 and 5 drives are downed every night - always with a tape stuck =
in the=20
the drive.&nbsp; Occasionally the system will<BR>&gt;&nbsp;&nbsp;&nbsp;&n=
bsp;=20
freeze dozens of tapes because they're seen as "unmountable" which leads =
to a=20
boatload of status 96 (no media)<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; failures=
=2E&nbsp;=20
Our backup success rate has dropped from over 98% to below 80% - manageme=
nt is=20
freaking out.&nbsp; We're<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; grasping at str=
aws=20
here folks, any help would be GREATLY=20
appreciated!<BR>&gt;<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp;=20
--<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; Regards,<BR>&gt;&nbsp;&nbsp;&nbsp;&nbs=
p; Ric=20
Sokolowski (Ric.Sokolowski AT motorola DOT com)<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; 
=
Staff=20
Systems Engineer<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; Phone: (954)=20
723-6332<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; Pager:=20
9545530742 AT messaging.nextel DOT com<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp; 
Motorola,=
=20
Inc.&nbsp; / CGISS / Enterprise Computing<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp;=
 8000=20
West Sunrise Blvd, MS 22-2F, Plantation, FL=20
33322<BR>&gt;<BR>&gt;<BR>&gt;<BR>&gt;<BR>________________________________=
_______________<BR>Veritas-bu=20
maillist&nbsp; -&nbsp; Veritas-bu AT mailman.eng.auburn DOT edu<BR><A=20
href=3D"http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu";>http:/=
/mailman.eng.auburn.edu/mailman/listinfo/veritas-bu</A><BR></DIV></BODY><=
/HTML>


<BR>
<html>

<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dwindows=
-1252">
<meta name=3D"GENERATOR" content=3D"Microsoft FrontPage 4.0">
<meta name=3D"ProgId" content=3D"FrontPage.Editor.Document">
<title>IMPORTANT WARNING</title>
</head>

<body>

<hr>
<p>IMPORTANT WARNING: The information in this message (and the documents =
attached to it, if any) is confidential and may be legally privileged. It=
 is intended solely for the addressee. Access to this message by anyone e=
lse is unauthorized. If you are not the intended recipient, any disclosur=
e, copying, distribution or any action taken, or omitted to be taken, in =
reliance on it is prohibited and may be unlawful. If you have received th=
is message in error, please delete all electronic copies of this message =
(and the documents attached to it, if any), destroy any hard copies you m=
ay have created and notify me immediately by replying to this email. Than=
k you.</p>

</body>

</html>



--=_DBFAA4F6.67073595--

<Prev in Thread] Current Thread [Next in Thread>