Veritas-bu

[Veritas-bu] Throughput problem any ideas?

2005-05-16 10:50:44
Subject: [Veritas-bu] Throughput problem any ideas?
From: Greg.Hindle AT constellation DOT com (Hindle, Greg)
Date: Mon, 16 May 2005 10:50:44 -0400
This is a multi-part message in MIME format.

------_=_NextPart_001_01C55A26.A33853B6
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

What settings are you using for IPMP?
=20
=20
Greg Hindle
=20

  _____ =20

From: Paul Keating [mailto:pkeating AT bank-banque-canada DOT ca]=20
Sent: Monday, May 16, 2005 10:45 AM
To: Hindle, Greg; veritas-bu AT mailman.eng.auburn DOT edu
Subject: RE: [Veritas-bu] Throughput problem any ideas?


Ok.
=20
I thought you were somehow implying something like a connection would be
made to an IP from the client, then the IP would move back and forth,
dynamically, between the two NICs on your server, depending on the load
on each NIC.
=20
So you are using Sun's IPMP in the same manner that we are.
=20
Interesting, you say it is not a Veritas supported configuration?
=20
We have not had any problems with the IPMP untill this past Wednesday
night, when a new Windows server was placed on the network and caused a
broadcast storm, which drove the CPU on a router to 100% utilization
(same router was defaultrouter for the backup master, and the Windows
server)......the router wasn't responding to the IPMP pings from the
Master server's NICs, and after MANY failovers nad fail backs, at one
BOTH NICs failed over simultaneously, taking the master completely
offline. :o(
all the jobs running at the time failed with various network related
status codes.
=20
Paul

        -----Original Message-----
        From: Hindle, Greg [mailto:Greg.Hindle AT constellation DOT com]=20
        Sent: May 16, 2005 10:37 AM
        To: Paul Keating; veritas-bu AT mailman.eng.auburn DOT edu
        Subject: RE: [Veritas-bu] Throughput problem any ideas?
=09
=09
        Thanks Paul. Its funny hop we use every day terms to tall about
tech issues. We use IPMP to load balance our nic. Set to active-active
means that both cards are running and will respond to IP request from
the 2 IP address that we assigned to that server. This is what I meant
by round-robin... :). I have forwarded your link on the the other
engineers here for review. Thanks and keep the suggestions coming...
        =20
        =20
        Greg Hindle
        =20

  _____ =20

        From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Paul
Keating
        Sent: Monday, May 16, 2005 9:53 AM
        To: veritas-bu AT mailman.eng.auburn DOT edu
        Subject: RE: [Veritas-bu] Throughput problem any ideas?
=09
=09
        I am unfamiliar with the setup you are running with the "round
robin IP addressing".
        =20
        I currently have 2 GigE NICs in my master, and they are
configured using Sun's IP Multipath.
        Each card has it's own address, and outbound connection are made
in a roundrobin fashion....when a backup job starts, it goes through one
of the two NICs. (Not sure of the algorythm used for selecting.) Because
the client sees the request from that IP, it replies to that interface,
so it provides us with some load balancing capability.
        It's a built in option in Solaris 8 and later.
        it also provides failover in the event that one connection goes
down, its IP will be picked up by the other NIC
        =20
        As for all the jobs starting and not transferring any data, we
had an issue like that....I'd come in, in the morning, and we'd have a
bunch of jobs hung, and everything after would be queued.
        =20
        Applied part 2 of this technote, re message queues, and the
problem went away.
        http://seer.support.veritas.com/docs/268122.htm
        =20
        Paul

                -----Original Message-----
                From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Hindle,
Greg
                Sent: May 16, 2005 9:41 AM
                To: veritas-bu AT mailman.eng.auburn DOT edu
                Subject: [Veritas-bu] Throughput problem any ideas?
                Importance: High
        =09
        =09
                Hello all,
                We have been having a problem here with our jobs
basically hanging. Failure rate is 50-80%. This came on all of a sudden
and we have been working on it for almost a week now with many people
involved. Our setup: We currently back up over 1200 servers each day. We
have 2 data centers, with a media server at each location and 1 master
at one location. I will call them site 1 and 2. We have all servers at
site 1 backing up to site 2 and all servers at site 2 backing up at site
1. Both sites have 2 L700 tape units with about 30+ drives.  Our media
and master server have 2 gig nics and using round robin IP addressing,
meaning the IP address is not tied to a card rather they bounce back and
forth in order to maximum throughput. We are using ether channel at one
site that has the master.  This setup worked great since Jan of this
year. Then one night it all stopped. Failures rates were 50-80%. The
media servers would connect to the client pc's then no data would pass.
While others servers worked fine. We struggled and look at everything to
find the cause. No changes were done to the Veritas network or the data
network. Veritas would not help us because they said we were in a
unsupported network config. We did send them some logs and they did say
we have packet reordering problem and that was the extent of the help.
So over the weekend we reduced the nics in our master and media server
to 1 and removed 1 IP address as well in order to stabilize the backup
network. It worked to a point, however we have doubled out backup times.
I am sending this here in hopes that others can share their setup. AND
to also ask if anyone has a setup that IS approved by Veritas that has
the ability to get more than a gig throughput on the media and master
servers. We want to understand what is the best way we should have a our
Solaris 8 master and media servers setup according to Veritas.
                =20
                =20
                Greg Hindle
                =20
        =09
        =09
        =09
                >>> The information contained in this e-mail
transmission is privileged and/or confidential intended solely for the
exclusive use of the individual addressee. If you are not the intended
addressee you are hereby notified that any retention, disclosure or
other use is strictly prohibited. If you have received this notification
in error, please immediately contact the sender and delete the material.
        =09


------_=_NextPart_001_01C55A26.A33853B6
Content-Type: text/html;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Message</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2900.2627" name=3DGENERATOR></HEAD>
<BODY>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D368235014-16052005><FONT =
face=3DArial=20
color=3D#0000ff size=3D2>What settings are you using for =
IPMP?</FONT></SPAN></DIV>
<DIV>&nbsp;</DIV>
<DIV align=3Dleft><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV align=3Dleft><FONT face=3DArial size=3D2>Greg Hindle</FONT><FONT =
face=3DArial=20
size=3D2></FONT></DIV>
<DIV>&nbsp;</DIV><BR>
<DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Paul Keating=20
[mailto:pkeating AT bank-banque-canada DOT ca] <BR><B>Sent:</B> Monday, May 16, 
=
2005=20
10:45 AM<BR><B>To:</B> Hindle, Greg;=20
veritas-bu AT mailman.eng.auburn DOT edu<BR><B>Subject:</B> RE: [Veritas-bu] =
Throughput=20
problem any ideas?<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =

size=3D2>Ok.</FONT></SPAN></DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>I=20
thought you were somehow implying something like a connection would be =
made to=20
an IP from the client, then the IP would move back and forth, =
dynamically,=20
between the two NICs on your server, depending on the load on each=20
NIC.</FONT></SPAN></DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>So you=20
are using Sun's&nbsp;IPMP in the same manner that we =
are.</FONT></SPAN></DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =

size=3D2>Interesting, you say it is not a Veritas supported=20
configuration?</FONT></SPAN></DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>We=20
have not had any problems with the IPMP untill this past Wednesday =
night, when a=20
new Windows server was placed on the network and caused a broadcast =
storm, which=20
drove the CPU on&nbsp;a router to 100% utilization (same router was=20
defaultrouter for the backup master, and the Windows server)......the =
router=20
wasn't responding to the IPMP&nbsp;pings from the Master server's NICs, =
and=20
after MANY failovers nad fail backs, at one BOTH NICs failed over=20
simultaneously, taking the master completely offline. =
:o(</FONT></SPAN></DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>all=20
the jobs running at the time failed with various network related status=20
codes.</FONT></SPAN></DIV>
<DIV><SPAN class=3D484284114-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D484284114-16052005></SPAN><SPAN=20
class=3D484284114-16052005></SPAN><SPAN class=3D484284114-16052005><FONT =
face=3DArial=20
color=3D#800000 size=3D2>Paul</FONT></SPAN></DIV>
<BLOCKQUOTE dir=3Dltr=20
style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #800000 2px =
solid; MARGIN-RIGHT: 0px">
  <DIV></DIV>
  <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr =
align=3Dleft><FONT=20
  face=3DTahoma size=3D2>-----Original Message-----<BR><B>From:</B> =
Hindle, Greg=20
  [mailto:Greg.Hindle AT constellation DOT com] <BR><B>Sent:</B> May 16, 2005 =
10:37=20
  AM<BR><B>To:</B> Paul Keating;=20
  veritas-bu AT mailman.eng.auburn DOT edu<BR><B>Subject:</B> RE: 
[Veritas-bu]=20
  Throughput problem any ideas?<BR><BR></FONT></DIV>
  <DIV dir=3Dltr align=3Dleft><SPAN class=3D749013414-16052005><FONT =
face=3DArial=20
  color=3D#0000ff size=3D2>Thanks Paul. Its funny hop we use every day =
terms to tall=20
  about tech issues. We use IPMP to load balance our nic. Set to =
active-active=20
  means that both cards are running and will respond to IP request from =
the 2 IP=20
  address that we assigned to that server. This is what I meant by=20
  round-robin... :). I have forwarded your link on the the other =
engineers here=20
  for review. Thanks and keep the suggestions =
coming...</FONT></SPAN></DIV>
  <DIV>&nbsp;</DIV>
  <DIV align=3Dleft><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
  <DIV align=3Dleft><FONT face=3DArial size=3D2>Greg Hindle</FONT><FONT =
face=3DArial=20
  size=3D2></FONT></DIV>
  <DIV>&nbsp;</DIV><BR>
  <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
  <HR tabIndex=3D-1>
  <FONT face=3DTahoma size=3D2><B>From:</B> =
veritas-bu-admin AT mailman.eng.auburn DOT edu=20
  [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] <B>On Behalf Of =
</B>Paul=20
  Keating<BR><B>Sent:</B> Monday, May 16, 2005 9:53 AM<BR><B>To:</B>=20
  veritas-bu AT mailman.eng.auburn DOT edu<BR><B>Subject:</B> RE: 
[Veritas-bu]=20
  Throughput problem any ideas?<BR></FONT><BR></DIV>
  <DIV></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000 size=3D2>I am=20
  unfamiliar with the setup you are running with the "round robin IP=20
  addressing".</FONT></SPAN></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000 size=3D2>I=20
  currently have 2 GigE NICs in my master, and they are configured using =
Sun's=20
  IP Multipath.</FONT></SPAN></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000 size=3D2>Each=20
  card has it's own address, and outbound connection are made in a =
roundrobin=20
  fashion....when a backup job starts, it goes through one of the two =
NICs. (Not=20
  sure of the algorythm used for selecting.) Because the client sees the =
request=20
  from that IP, it replies to that interface, so it provides us with =
some load=20
  balancing capability.</FONT></SPAN></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000 size=3D2>It's=20
  a built in option in Solaris 8 and later.</FONT></SPAN></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000 size=3D2>it=20
  also provides failover in the event that one connection goes down, its =
IP will=20
  be picked up by the other NIC</FONT></SPAN></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000 size=3D2>As=20
  for all the jobs starting and not transferring any data, we had an =
issue like=20
  that....I'd come in, in the morning, and we'd have a bunch of jobs =
hung, and=20
  everything after would be queued.</FONT></SPAN></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000=20
  size=3D2>Applied part 2 of this technote, re message queues, and the =
problem=20
  went away.</FONT></SPAN></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000 size=3D2><A=20
  =
href=3D"http://seer.support.veritas.com/docs/268122.htm";>http://seer.supp=
ort.veritas.com/docs/268122.htm</A></FONT></SPAN></DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000=20
  size=3D2></FONT></SPAN>&nbsp;</DIV>
  <DIV><SPAN class=3D750594513-16052005><FONT face=3DArial =
color=3D#800000=20
  size=3D2>Paul</FONT></SPAN></DIV>
  <BLOCKQUOTE=20
  style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #800000 2px =
solid; MARGIN-RIGHT: 0px">
    <DIV></DIV>
    <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr =
align=3Dleft><FONT=20
    face=3DTahoma size=3D2>-----Original Message-----<BR><B>From:</B>=20
    veritas-bu-admin AT mailman.eng.auburn DOT edu=20
    [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] <B>On Behalf Of =
</B>Hindle,=20
    Greg<BR><B>Sent:</B> May 16, 2005 9:41 AM<BR><B>To:</B>=20
    veritas-bu AT mailman.eng.auburn DOT edu<BR><B>Subject:</B> [Veritas-bu] =
Throughput=20
    problem any ideas?<BR><B>Importance:</B> High<BR><BR></FONT></DIV>
    <DIV><SPAN class=3D670555212-16052005><FONT face=3DArial =
size=3D2>Hello=20
    all,</FONT></SPAN></DIV>
    <DIV><SPAN class=3D670555212-16052005><FONT face=3DArial size=3D2>We =
have been=20
    having a problem here with our jobs basically hanging. Failure rate =
is=20
    50-80%. This came on all of a sudden and we have been working on it =
for=20
    almost a week now with many people involved.&nbsp;Our setup: We =
currently=20
    back up over 1200 servers each day. We have 2 data centers, with a =
media=20
    server at each location and 1 master at one location. I will call =
them site=20
    1 and 2. We have all servers at site 1 backing up to site 2 and all =
servers=20
    at site 2 backing up at site 1. Both sites have 2 L700 tape units =
with about=20
    30+ drives.&nbsp; Our media and master server have 2 gig nics and =
using=20
    round robin IP addressing, meaning the IP address is not tied to a =
card=20
    rather they bounce back and forth in order to maximum throughput. We =
are=20
    using ether channel at one site that has the master.&nbsp; This =
setup worked=20
    great since Jan of this year. Then one night it all stopped. =
Failures rates=20
    were 50-80%. The media servers would connect to the client pc's then =
no data=20
    would pass.&nbsp; While others servers worked fine. We struggled and =
look at=20
    everything to find the cause. No changes were done to the Veritas =
network or=20
    the data network. Veritas would not help us because they said we =
were in a=20
    unsupported network config. We did send them some logs and they did =
say we=20
    have packet reordering problem and that was the extent of the help. =
So over=20
    the weekend we reduced the nics in our master and media server to 1 =
and=20
    removed 1 IP address as well in order to stabilize the backup =
network. It=20
    worked to a point, however we have doubled out backup times. I am =
sending=20
    this here in hopes that others can share their setup. AND to also =
ask if=20
    anyone has a setup that IS approved by Veritas that has the ability =
to get=20
    more than a gig throughput on the media and master servers. We want =
to=20
    understand what is the best way we should have a our Solaris 8 =
master and=20
    media servers setup according to Veritas.</FONT></SPAN></DIV>
    <DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
    <DIV align=3Dleft><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
    <DIV align=3Dleft><FONT face=3DArial size=3D2>Greg =
Hindle</FONT><FONT face=3DArial=20
    size=3D2></FONT></DIV>
    <DIV>&nbsp;</DIV><FONT size=3D2><BR><BR><BR>&gt;&gt;&gt; The =
information=20
    contained in this e-mail transmission is privileged and/or =
confidential=20
    intended solely for the exclusive use of the individual addressee. =
If you=20
    are not the intended addressee you are hereby notified that any =
retention,=20
    disclosure or other use is strictly prohibited. If you have received =
this=20
    notification in error, please immediately contact the sender and =
delete the=20
    material.<BR></BLOCKQUOTE></BLOCKQUOTE></FONT></BODY></HTML>

------_=_NextPart_001_01C55A26.A33853B6--