nv-l

Re: Antw: RE: [nv-l] Stress Testing NV, looking for opinions

2004-06-04 10:34:09
Subject: Re: Antw: RE: [nv-l] Stress Testing NV, looking for opinions
From: James Shanks <jshanks AT us.ibm DOT com>
To: nv-l AT lists.us.ibm DOT com
Date: Fri, 4 Jun 2004 10:14:48 -0400

They've been posted on the nv-l list many times.
Try searching the archives at http://lists.skills-1st.co.uk/mharc/html/nv-l/

James Shanks
Level 3 Support  for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group



"Georg Gangl" <Georg.Gangl AT brz.gv DOT at>
Sent by: owner-nv-l AT lists.us.ibm DOT com

06/04/2004 09:45 AM
Please respond to
nv-l

To
<nv-l AT lists.us.ibm DOT com>
cc
Subject
Antw: RE: [nv-l] Stress Testing NV, looking for opinions





Please tell me where I can find the Performance Guidelines mentioned
below.

Many thanks,

George Gangl
BRZ Network Systems
Vienna

>>> jshanks AT us.ibm DOT com 00:32:25 Freitag, 4. Juni 2004 >>>

At the risk of starting a firestorm, I feel I must respond to some of
Scott's questions and issues.

Scott,  I just want to prepare you for what you may find.  

What you may find, is that despite the speed of your processor(s), you
are up against both system and old design limitations, which are not
easily remedied, rather than proof of some bug in nvcold.

Well, perhaps, you'll find that, yes.  But perhaps also the end
result
may be that you will simply find the upper limit of what NetView event
processing can handle, given the way it is written today and the
amount
of work that can be done on your box in that period of time.  As far
as
I know there are no benchmarks for nvcold performance.   And I know
there are none for nvcorrd performance either.  So with this course of
action you may be the one determining those benchmarks.

You are correct that socket stats and performance are tied together,
but
perhaps not in the way that you think.  Those states may not represent
errors at all.  Sockets left in states like TIME_WAIT, CLOSE_WAIT and
FIN2_WAIT are the result of heavy usage and operating system
resources.
Some systems can be tuned to reduce the amount of time between these
states, which occur at the end of the communications cycle, when at
least one end of the communications pipe has been closed,  though I am
not enough of an OS guy to tell you exactly what they mean nor how to
tune to reduce them.  But  periodically the OS checks all open sockets
and changes the states so that the ones that should be closed go to
"CLOSED" over time.   So if you are using nvcold heavily, that's just
what I would expect to see if he's opening and closing a lot of
sockets.
And he would be doing just if you have a lot of traps running through
rulesets with Query Smartset nodes in them.

Every  Query Smartset in a ruleset is just that, a new call to nvcold
by
nvcorrd.  For each new call, nvcold must then query the object
database
to determine what smartsets a particular node is in, and return all
those in a list.  So performance is going to be determined by  both
the
size of the database and the number of smartsets to be included.  I'm
not savvy about the internals of nvcold, but that's real work, and I
suspect all this means sockets to be opened and closed between him and
nvcorrd as well as between him and ovwdb.  So for some trap rates, no
matter how fast your box is, it may not be fast enough to keep up with
the demand being placed on the NetView daemons by your automation.
Let's remember that nvcold, like all the other NetView for UNIX
daemons, except the new java ones, is single-threaded.  That's one
operation at time.  So if every trap goes through a Query Smartset, it
is easy to see how you could overwhelm the available resources at some
point.  The same is true, of course, if they were multi-threaded.  It
would just take longer.  But that's one of the reasons why you want to
try to try to make calls outside of nvcorrd, like Query Smartset and
Query Database, and Query MIB, sparingly when you write a ruleset, as
the performance guidelines I posted some time ago emphasize.  

As for MLM and trap storms, most of those we see are indeed
repetitive.
In the seven years I have looked at customer logs and traces, they
usually come from the same devices over and over again.  They usually
come from routers which are overworked and not well-configured, and in
many NetView environments, the NetView folks have no control over
either
one of those things.  But they can configure MLM to do thresholding.
That's not breaking your automation but protecting it; if we only fire
it for the first of every ten identical traps rather than for every
one,
provided that you know when you get the end result that there could be
nine more identical triggers behind it.  So MLM is not a panacea, and
it
does require that you analyze storms which have already happened in
order to be effective.  But what other choice is there?   Without MLM
thresholding, trapd will just queue the traps until he runs out of
storage to hold them; but assuming that doesn't happen, he'll start
processing them like mad when the storm stops, and simply pass the
bottleneck along to the connected applications.  What will they do?
nvcorrd, nvserverd, and actionsvr will then begin processing like mad
themselves, but probably not fast enough to stay current.   Your
one-trap-at-a-time automation may still work but it'll be so slow that
it might as well not work.  Your pop-ups or pages or whatever will be
many minutes if not hours behind.  What good is being hours behind in
processing traps?

I'm afraid I don't see any alternatives.  For every system there are
limits, and limits imply trade-offs, and trade-offs imply that you
have
to find a way to live with what you have.   That's the fundamental law
of system performance.   If you cannot find a way to produce more
resources to handle the load when it occurs, then you have to reduce
the
load.  And that's what MLM does.  Even if we multi-threaded trapd to
take over the thresholding job, at some point he too would have to
make
a decision about what to do when the load was too high.  And I'll bet
the decision would be to  stop processing duplicate traps in order  to
protect every process that comes down the line afterward.

In short, I think what you want to test is a good idea.  Just don't be
surprised if you don't find broken code at the end of it, but rather
system and design limitations.

Want a script to test with?  Here's one of mine which uses snmptrap,
and
sends any number of simulated Cisco LinkDown traps with the variable
content modified so that for any given one, I can tell where it falls
in
the batch sent.  I call it  "EventBlast" and you invoke it like this,
       EventBlast  <number of traps to send> <target NetView>


#!/bin/ksh
max=$1
target=$2
src=""
event=ciscoLinkDown
#       set -x
let count=0
  while (($count < $max)) ; do
      /usr/OV/bin/snmptrap  $target .1.3.6.1.4.1.9  \
      $src  2 0 1 \
       .1.3.0  Integer  $count  \
       .1.4.0  OctetStringascii  "`date`" \
       .1.5.0  Integer  $max  \
       .1.6.0  OctetStringascii  "blast test mode"
#               sleep 1
       let count=$count+1
      echo "sent  $event EventBlast$count to $target "
  done


Of course, you can modify this to send any other trap, with any other
variables you need to test your rulesets.

I sincerely hope this helps.

James Shanks
Level 3 Support  for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group



<Prev in Thread] Current Thread [Next in Thread>