Veritas-bu

[Veritas-bu] HELP - media and I/O errors

2004-02-11 14:19:36
Subject: [Veritas-bu] HELP - media and I/O errors
From: mark AT steelfamily DOT org (mark)
Date: Wed, 11 Feb 2004 19:19:36 -0000
This is a multi-part message in MIME format.

------=_NextPart_000_0000_01C3F0D3.FCCF36F0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

MessageI've seen these error conditions when multiplexing is set to more
than 20. It just seems to overload the scsi stack.

I've also seen similar if you set the SIZE_DATA_BUFFER higher than the
particular fibre card/driver combination can support - with fibre and LTO,
its good to be able to use 128k buffers, but again some scsi/fibre drivers
cant cope with that internally and give you similar error returns.

What are your parameters ? ( jobs per storage unit drive / jobs per policy,
number of jobs, size & number of buffers ).

If you are hitting limits, would be worth trying a run with lower values.

Regards
Mark

  -----Original Message-----
  From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]On Behalf Of Sokolowski
Ric-ERS004
  Sent: 11 February 2004 16:28
  To: 'veritas-bu AT mailman.eng.auburn DOT edu'
  Subject: [Veritas-bu] HELP - media and I/O errors


  Our system:

  NB 4.5 MP5
  master - HP-UX 11.00
  media - 4 HP-UX 11.00, 1 HP-UX 11.11
  STK L700 (HP20/700) w/10 HP LTO 1 drives w/SSO
  5 HP 2/1 FC/SCSI bridges
  1 Brocade 2800

  We're seeing tons of media-related errors (70% status 86 - media position,
30% status 84 - media write) spread across
  all drives.  Some nights we see no errors, other nights we'll see 50-100
media-related failures.  We see the failures when
  reusing tapes and with brand new tapes.  All drives have been cleaned
recently.  We have had cases open w/Veritas and
  HP for just over 4 weeks now.  Veritas has examined over a months worth of
log files and has determined that the
  problem is hardware related.  HP replaced 3 drives, we saw media failures
on these 3 new drives the same day they were
  replaced.  HP also replaced the robot controller, the camera, and one of
the Fibre bridges.  We're not seeing any
  communication errors on the FC switch.  Everything has the latest
available firmware.  Whenever we get the status 84/86,
  we see a  lot of things like "cannot read from media socket 10", "ioctl
(MTREW) failed on media id 402280, drive index 4,
  I/O error (bptm.c.7197)" and "write error on media id 402280, drive index
4, writing header block, I/O error".  Normally,
  between 2 and 5 drives are downed every night - always with a tape stuck
in the the drive.  Occasionally the system will
  freeze dozens of tapes because they're seen as "unmountable" which leads
to a boatload of status 96 (no media)
  failures.  Our backup success rate has dropped from over 98% to below
80% - management is freaking out.  We're
  grasping at straws here folks, any help would be GREATLY appreciated!

  --
  Regards,
  Ric Sokolowski (Ric.Sokolowski AT motorola DOT com)
  Staff Systems Engineer
  Phone: (954) 723-6332
  Pager: 9545530742 AT messaging.nextel DOT com
  Motorola, Inc.  / CGISS / Enterprise Computing
  8000 West Sunrise Blvd, MS 22-2F, Plantation, FL 33322




------=_NextPart_000_0000_01C3F0D3.FCCF36F0
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Message</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.2800.1400" name=3DGENERATOR></HEAD>
<BODY>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =
size=3D2>I've=20
seen these error conditions when multiplexing is set to more than 20. It =
just=20
seems to overload the scsi stack. </FONT></SPAN></DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =
size=3D2>I've=20
also seen similar if you set the SIZE_DATA_BUFFER higher than the =
particular=20
fibre card/driver combination can support - with fibre and LTO, its good =
to be=20
able to use 128k buffers, but again some scsi/fibre drivers cant cope =
with that=20
internally and give you similar error returns.</FONT></SPAN></DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =
size=3D2>What=20
are your parameters ? ( jobs per storage unit drive / jobs per policy, =
number of=20
jobs, size &amp; number of buffers ).</FONT></SPAN></DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =
size=3D2>If you=20
are hitting limits, would be worth trying a run with lower=20
values.</FONT></SPAN></DIV>
<DIV><SPAN class=3D687031419-11022004></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =

size=3D2>Regards</FONT></SPAN></DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =

size=3D2>Mark</FONT></SPAN></DIV>
<DIV><SPAN class=3D687031419-11022004><FONT face=3DArial color=3D#0000ff =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
  <DIV class=3DOutlookMessageHeader dir=3Dltr align=3Dleft><FONT =
face=3DTahoma=20
  size=3D2>-----Original Message-----<BR><B>From:</B>=20
  veritas-bu-admin AT mailman.eng.auburn DOT edu=20
  [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]<B>On Behalf Of =
</B>Sokolowski=20
  Ric-ERS004<BR><B>Sent:</B> 11 February 2004 16:28<BR><B>To:</B>=20
  'veritas-bu AT mailman.eng.auburn DOT edu'<BR><B>Subject:</B> [Veritas-bu] =
HELP -=20
  media and I/O errors<BR><BR></FONT></DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN class=3D070274215-11022004>Our=20
  system:</SPAN></FONT></DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN=20
  class=3D070274215-11022004></SPAN></FONT>&nbsp;</DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN class=3D070274215-11022004>NB =
4.5=20
  MP5</SPAN></FONT></DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN =
class=3D070274215-11022004>master - HP-UX=20
  11.00</SPAN></FONT></DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN =
class=3D070274215-11022004>media - 4 HP-UX=20
  11.00, 1 HP-UX 11.11</SPAN></FONT></DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN class=3D070274215-11022004>STK =
L700=20
  (HP20/700)&nbsp;w/10 HP LTO 1 drives w/SSO</SPAN></FONT></DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN class=3D070274215-11022004>5 HP =
2/1 FC/SCSI=20
  bridges</SPAN></FONT></DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN class=3D070274215-11022004>1 =
Brocade=20
  2800</SPAN></FONT></DIV>
  <DIV><FONT face=3DArial size=3D2><SPAN=20
  class=3D070274215-11022004></SPAN></FONT>&nbsp;</DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><FONT =
face=3DArial=20
  size=3D2>We're seeing tons of media-related errors (70% status 86 - =
media=20
  position, 30% status 84 - media write) spread=20
across</FONT></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><FONT =
face=3DArial size=3D2>all=20
  drives.&nbsp; Some nights we </FONT></SPAN></FONT><FONT =
size=3D+0><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial size=3D2>see no errors, =
other nights=20
  we'll see 50-100 media-related failures.&nbsp; We see the failures=20
  when</FONT></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><FONT =
face=3DArial=20
  size=3D2>reusing tapes and with </FONT></SPAN></FONT><FONT =
size=3D+0><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial size=3D2>brand new =
tapes.&nbsp; All=20
  drives have been cleaned recently.&nbsp; <SPAN =
class=3D070274215-11022004>We=20
  have had cases open w/Veritas and</SPAN></FONT></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><FONT =
face=3DArial=20
  size=3D2><SPAN class=3D070274215-11022004>HP for just over 4=20
  </SPAN></FONT></SPAN></FONT><FONT size=3D+0><SPAN =
class=3D070274215-11022004><FONT=20
  face=3DArial size=3D2><SPAN class=3D070274215-11022004>weeks =
now.&nbsp;&nbsp;Veritas=20
  has examined over a months worth of log files and has determined that=20
  the</SPAN></FONT></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><FONT =
face=3DArial=20
  size=3D2><SPAN class=3D070274215-11022004>problem is hardware=20
  </SPAN></FONT></SPAN></FONT><FONT size=3D+0><SPAN =
class=3D070274215-11022004><FONT=20
  face=3DArial size=3D2><SPAN class=3D070274215-11022004>related.&nbsp; =
</SPAN>HP=20
  replaced 3 drives, we saw media failures on these 3 new drives the =
same day=20
  they were</FONT></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><FONT =
face=3DArial=20
  size=3D2>replaced.&nbsp;&nbsp;HP </FONT></SPAN></FONT><FONT =
size=3D+0><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial size=3D2>also replaced =
the robot=20
  controller, the camera, and one of the Fibre bridges.&nbsp; We're not =
seeing=20
  any</FONT></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><FONT =
face=3DArial=20
  size=3D2>communication </FONT></SPAN></FONT><FONT size=3D+0><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial size=3D2>errors on the =
FC=20
  switch.&nbsp; Everything has the latest available=20
  firmware.&nbsp;&nbsp;Whenever we get&nbsp;the status=20
  84/86,</FONT></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><FONT =
face=3DArial size=3D2>we=20
  see a&nbsp; </FONT></SPAN></FONT><FONT size=3D+0><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial size=3D2>lot of things =
like=20
  "</FONT><FONT face=3DArial size=3D2>cannot read from media socket =
10</FONT><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial size=3D2>", =
"</FONT><FONT face=3DArial=20
  size=3D2>ioctl (MTREW) failed on media id 402280, </FONT><FONT =
face=3DArial=20
  size=3D2>drive index 4,</FONT></SPAN></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial size=3D2>I/O=20
  </FONT></SPAN></SPAN></FONT><FONT size=3D+0><SPAN =
class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial size=3D2>error=20
  (bptm.c.7197)</FONT><SPAN class=3D070274215-11022004><FONT =
face=3DArial size=3D2>"=20
  and&nbsp;"</FONT><FONT face=3DArial size=3D2>write error on media id =
402280, drive=20
  </FONT><FONT face=3DArial><FONT size=3D2>index 4, writing header =
block, I/O=20
  error<SPAN class=3D070274215-11022004>".&nbsp;=20
  Normally,</SPAN></FONT></FONT></SPAN></SPAN></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><SPAN class=3D070274215-11022004><FONT =
face=3DArial><FONT=20
  size=3D2><SPAN=20
  =
class=3D070274215-11022004></SPAN></FONT></FONT></SPAN></SPAN></SPAN></FO=
NT><FONT=20
  size=3D+0><SPAN class=3D070274215-11022004><SPAN =
class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial><FONT size=3D2><SPAN=20
  class=3D070274215-11022004>between 2 and 5 drives are downed =
every&nbsp;night -=20
  always with a tape stuck in the the drive.&nbsp; Occasionally the =
system=20
  </SPAN></FONT></FONT></SPAN></SPAN></SPAN></FONT><FONT size=3D+0><SPAN =

  class=3D070274215-11022004><SPAN class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial><FONT size=3D2><SPAN=20
  =
class=3D070274215-11022004>will</SPAN></FONT></FONT></SPAN></SPAN></SPAN>=
</FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><SPAN class=3D070274215-11022004><FONT =
face=3DArial><FONT=20
  size=3D2><SPAN class=3D070274215-11022004>freeze dozens of tapes =
because they're=20
  seen as "unmountable" which leads to a boatload of status 96 (no=20
  media)</SPAN></FONT></FONT></SPAN></SPAN></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><SPAN class=3D070274215-11022004><FONT =
face=3DArial><FONT=20
  size=3D2><SPAN=20
  =
class=3D070274215-11022004></SPAN></FONT></FONT></SPAN></SPAN></SPAN></FO=
NT><FONT=20
  size=3D+0><SPAN class=3D070274215-11022004><SPAN =
class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial><FONT size=3D2><SPAN=20
  class=3D070274215-11022004>failures.&nbsp; Our backup success rate has =
dropped=20
  from over 98% to below 80% - management is freaking out.&nbsp;=20
  We're</SPAN></FONT></FONT></SPAN></SPAN></SPAN></FONT></DIV>
  <DIV><FONT size=3D+0><SPAN class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><SPAN class=3D070274215-11022004><FONT =
face=3DArial><FONT=20
  size=3D2><SPAN=20
  =
class=3D070274215-11022004></SPAN></FONT></FONT></SPAN></SPAN></SPAN></FO=
NT><FONT=20
  size=3D+0><SPAN class=3D070274215-11022004><SPAN =
class=3D070274215-11022004><SPAN=20
  class=3D070274215-11022004><FONT face=3DArial><FONT size=3D2><SPAN=20
  class=3D070274215-11022004>grasping at straws here folks, any help =
would be=20
  GREATLY =
appreciated!</SPAN></FONT></FONT></SPAN></SPAN></SPAN></FONT></DIV>
  <DIV><FONT size=3D1><FONT face=3D"Comic Sans =
MS"></FONT></FONT>&nbsp;</DIV>
  <DIV><FONT size=3D1><FONT face=3D"Comic Sans MS">-- <BR>Regards,=20
  </FONT></FONT></DIV>
  <P><FONT size=3D1><FONT face=3D"Comic Sans MS">Ric Sokolowski=20
  (Ric.Sokolowski AT motorola DOT com) <BR>Staff Systems Engineer <BR>Phone: =
(954)=20
  723-6332 <BR>Pager: 9545530742 AT messaging.nextel DOT com <BR>Motorola, =
Inc.&nbsp; /=20
  CGISS / Enterprise Computing <BR>8000 West Sunrise Blvd, MS 22-2F, =
Plantation,=20
  FL 33322 </FONT></FONT></P><BR>
  <DIV>&nbsp;</DIV></BLOCKQUOTE></BODY></HTML>

------=_NextPart_000_0000_01C3F0D3.FCCF36F0--



<Prev in Thread] Current Thread [Next in Thread>