Veritas-bu

[Veritas-bu] Throughput problem any ideas?

2005-05-16 09:53:17
Subject: [Veritas-bu] Throughput problem any ideas?
From: pkeating AT bank-banque-canada DOT ca (Paul Keating)
Date: Mon, 16 May 2005 09:53:17 -0400
This is a multi-part message in MIME format.

------_=_NextPart_001_01C55A1E.9C8117EF
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

I am unfamiliar with the setup you are running with the "round robin IP
addressing".
=20
I currently have 2 GigE NICs in my master, and they are configured using
Sun's IP Multipath.
Each card has it's own address, and outbound connection are made in a
roundrobin fashion....when a backup job starts, it goes through one of
the two NICs. (Not sure of the algorythm used for selecting.) Because
the client sees the request from that IP, it replies to that interface,
so it provides us with some load balancing capability.
It's a built in option in Solaris 8 and later.
it also provides failover in the event that one connection goes down,
its IP will be picked up by the other NIC
=20
As for all the jobs starting and not transferring any data, we had an
issue like that....I'd come in, in the morning, and we'd have a bunch of
jobs hung, and everything after would be queued.
=20
Applied part 2 of this technote, re message queues, and the problem went
away.
http://seer.support.veritas.com/docs/268122.htm
=20
Paul

        -----Original Message-----
        From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Hindle,
Greg
        Sent: May 16, 2005 9:41 AM
        To: veritas-bu AT mailman.eng.auburn DOT edu
        Subject: [Veritas-bu] Throughput problem any ideas?
        Importance: High
=09
=09
        Hello all,
        We have been having a problem here with our jobs basically
hanging. Failure rate is 50-80%. This came on all of a sudden and we
have been working on it for almost a week now with many people involved.
Our setup: We currently back up over 1200 servers each day. We have 2
data centers, with a media server at each location and 1 master at one
location. I will call them site 1 and 2. We have all servers at site 1
backing up to site 2 and all servers at site 2 backing up at site 1.
Both sites have 2 L700 tape units with about 30+ drives.  Our media and
master server have 2 gig nics and using round robin IP addressing,
meaning the IP address is not tied to a card rather they bounce back and
forth in order to maximum throughput. We are using ether channel at one
site that has the master.  This setup worked great since Jan of this
year. Then one night it all stopped. Failures rates were 50-80%. The
media servers would connect to the client pc's then no data would pass.
While others servers worked fine. We struggled and look at everything to
find the cause. No changes were done to the Veritas network or the data
network. Veritas would not help us because they said we were in a
unsupported network config. We did send them some logs and they did say
we have packet reordering problem and that was the extent of the help.
So over the weekend we reduced the nics in our master and media server
to 1 and removed 1 IP address as well in order to stabilize the backup
network. It worked to a point, however we have doubled out backup times.
I am sending this here in hopes that others can share their setup. AND
to also ask if anyone has a setup that IS approved by Veritas that has
the ability to get more than a gig throughput on the media and master
servers. We want to understand what is the best way we should have a our
Solaris 8 master and media servers setup according to Veritas.
        =20
        =20
        Greg Hindle
        =20
=09
=09
=09
        >>> The information contained in this e-mail transmission is
privileged and/or confidential intended solely for the exclusive use of
the individual addressee. If you are not the intended addressee you are
hereby notified that any retention, disclosure or other use is strictly
prohibited. If you have received this notification in error, please
immediately contact the sender and delete the material.
=09


------_=_NextPart_001_01C55A1E.9C8117EF
Content-Type: text/html;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Message</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2800.1170" name=3DGENERATOR></HEAD>
<BODY>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>I am=20
unfamiliar with the setup you are running with the "round robin IP=20
addressing".</FONT></SPAN></DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>I=20
currently have 2 GigE NICs in my master, and they are configured using =
Sun's IP=20
Multipath.</FONT></SPAN></DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>Each=20
card has it's own address, and outbound connection are made in a =
roundrobin=20
fashion....when a backup job starts, it goes through one of the two =
NICs. (Not=20
sure of the algorythm used for selecting.) Because the client sees the =
request=20
from that IP, it replies to that interface, so it provides us with some =
load=20
balancing capability.</FONT></SPAN></DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>It's a=20
built in option in Solaris 8 and later.</FONT></SPAN></DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>it=20
also provides failover in the event that one connection goes down, its =
IP will=20
be picked up by the other NIC</FONT></SPAN></DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =
size=3D2>As for=20
all the jobs starting and not transferring any data, we had an issue =
like=20
that....I'd come in, in the morning, and we'd have a bunch of jobs hung, =
and=20
everything after would be queued.</FONT></SPAN></DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =

size=3D2>Applied part 2 of this technote, re message queues, and the =
problem went=20
away.</FONT></SPAN></DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =
size=3D2><A=20
href=3D"http://seer.support.veritas.com/docs/268122.htm";>http://seer.supp=
ort.veritas.com/docs/268122.htm</A></FONT></SPAN></DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =

size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D750594513-16052005><FONT face=3DArial color=3D#800000 =

size=3D2>Paul</FONT></SPAN></DIV>
<BLOCKQUOTE=20
style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #800000 2px =
solid; MARGIN-RIGHT: 0px">
  <DIV></DIV>
  <DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr =
align=3Dleft><FONT=20
  face=3DTahoma size=3D2>-----Original Message-----<BR><B>From:</B>=20
  veritas-bu-admin AT mailman.eng.auburn DOT edu=20
  [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] <B>On Behalf Of =
</B>Hindle,=20
  Greg<BR><B>Sent:</B> May 16, 2005 9:41 AM<BR><B>To:</B>=20
  veritas-bu AT mailman.eng.auburn DOT edu<BR><B>Subject:</B> [Veritas-bu] =
Throughput=20
  problem any ideas?<BR><B>Importance:</B> High<BR><BR></FONT></DIV>
  <DIV><SPAN class=3D670555212-16052005><FONT face=3DArial =
size=3D2>Hello=20
  all,</FONT></SPAN></DIV>
  <DIV><SPAN class=3D670555212-16052005><FONT face=3DArial size=3D2>We =
have been=20
  having a problem here with our jobs basically hanging. Failure rate is =
50-80%.=20
  This came on all of a sudden and we have been working on it for almost =
a week=20
  now with many people involved.&nbsp;Our setup: We currently back up =
over 1200=20
  servers each day. We have 2 data centers, with a media server at each =
location=20
  and 1 master at one location. I will call them site 1 and 2. We have =
all=20
  servers at site 1 backing up to site 2 and all servers at site 2 =
backing up at=20
  site 1. Both sites have 2 L700 tape units with about 30+ drives.&nbsp; =
Our=20
  media and master server have 2 gig nics and using round robin IP =
addressing,=20
  meaning the IP address is not tied to a card rather they bounce back =
and forth=20
  in order to maximum throughput. We are using ether channel at one site =
that=20
  has the master.&nbsp; This setup worked great since Jan of this year. =
Then one=20
  night it all stopped. Failures rates were 50-80%. The media servers =
would=20
  connect to the client pc's then no data would pass.&nbsp; While others =
servers=20
  worked fine. We struggled and look at everything to find the cause. No =
changes=20
  were done to the Veritas network or the data network. Veritas would =
not help=20
  us because they said we were in a unsupported network config. We did =
send them=20
  some logs and they did say we have packet reordering problem and that =
was the=20
  extent of the help. So over the weekend we reduced the nics in our =
master and=20
  media server to 1 and removed 1 IP address as well in order to =
stabilize the=20
  backup network. It worked to a point, however we have doubled out =
backup=20
  times. I am sending this here in hopes that others can share their =
setup. AND=20
  to also ask if anyone has a setup that IS approved by Veritas that has =
the=20
  ability to get more than a gig throughput on the media and master =
servers. We=20
  want to understand what is the best way we should have a our Solaris 8 =
master=20
  and media servers setup according to Veritas.</FONT></SPAN></DIV>
  <DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
  <DIV align=3Dleft><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
  <DIV align=3Dleft><FONT face=3DArial size=3D2>Greg Hindle</FONT><FONT =
face=3DArial=20
  size=3D2></FONT></DIV>
  <DIV>&nbsp;</DIV><FONT size=3D2><BR><BR><BR>&gt;&gt;&gt; The =
information=20
  contained in this e-mail transmission is privileged and/or =
confidential=20
  intended solely for the exclusive use of the individual addressee. If =
you are=20
  not the intended addressee you are hereby notified that any retention, =

  disclosure or other use is strictly prohibited. If you have received =
this=20
  notification in error, please immediately contact the sender and =
delete the=20
  material.<BR></BLOCKQUOTE></FONT></BODY></HTML>
=00
------_=_NextPart_001_01C55A1E.9C8117EF--