RE: [nv-l] Stress Testing NV, looking for opinions
2004-06-03 18:47:13
At the risk of starting a firestorm,
I feel I must respond to some of Scott's questions and issues.
Scott, I just want to prepare
you for what you may find.
What you may find, is that despite the
speed of your processor(s), you are up against both system and old design
limitations, which are not easily remedied, rather than proof of some bug
in nvcold.
Well, perhaps, you'll find that,
yes. But perhaps also the end result may be that you will simply
find the upper limit of what NetView event processing can handle, given
the way it is written today and the amount of work that can be done on
your box in that period of time. As far as I know there are no benchmarks
for nvcold performance. And I know there are none for nvcorrd performance
either. So with this course of action you may be the one determining
those benchmarks.
You are correct that socket stats and
performance are tied together, but perhaps not in the way that you think.
Those states may not represent errors at all. Sockets left
in states like TIME_WAIT, CLOSE_WAIT
and FIN2_WAIT are the result of heavy
usage and operating system resources. Some systems can be tuned to
reduce the amount of time between these states, which occur at the end
of the communications cycle, when at least one end of the communications
pipe has been closed, though I am not enough of an OS guy to tell
you exactly what they mean nor how to tune to reduce them. But periodically
the OS checks all open sockets and changes the states so that the ones
that should be closed go to "CLOSED" over time. So if
you are using nvcold heavily, that's just what I would expect to see if
he's opening and closing a lot of sockets. And he would be doing
just if you have a lot of traps running through rulesets with Query Smartset
nodes in them.
Every Query Smartset in a ruleset
is just that, a new call to nvcold by nvcorrd. For each new call,
nvcold must then query the object database to determine what smartsets
a particular node is in, and return all those in a list. So performance
is going to be determined by both the size of the database and the
number of smartsets to be included. I'm not savvy about the internals
of nvcold, but that's real work, and I suspect all this means sockets to
be opened and closed between him and nvcorrd as well as between him and
ovwdb. So for some trap rates, no matter how fast your box is, it
may not be fast enough to keep up with the demand being placed on the NetView
daemons by your automation. Let's remember that nvcold, like all
the other NetView for UNIX daemons, except the new java ones, is
single-threaded. That's one operation at time. So if every
trap goes through a Query Smartset, it is easy to see how you could overwhelm
the available resources at some point. The same is true, of course,
if they were multi-threaded. It would just take longer. But
that's one of the reasons why you want to try to try to make calls outside
of nvcorrd, like Query Smartset and Query Database, and Query MIB, sparingly
when you write a ruleset, as the performance guidelines I posted some time
ago emphasize.
As for MLM and trap storms, most of
those we see are indeed repetitive. In the seven years I have looked
at customer logs and traces, they usually come from the same devices over
and over again. They usually come from routers which are overworked
and not well-configured, and in many NetView environments, the NetView
folks have no control over either one of those things. But they can
configure MLM to do thresholding. That's not breaking your automation
but protecting it; if we only fire it for the first of every ten identical
traps rather than for every one, provided that you know when you get the
end result that there could be nine more identical triggers behind it.
So MLM is not a panacea, and it does require that you analyze storms
which have already happened in order to be effective. But what other
choice is there? Without MLM thresholding, trapd will just queue
the traps until he runs out of storage to hold them; but assuming that
doesn't happen, he'll start processing them like mad when the storm stops,
and simply pass the bottleneck along to the connected applications. What
will they do? nvcorrd, nvserverd, and actionsvr will then begin processing
like mad themselves, but probably not fast enough to stay current.
Your one-trap-at-a-time automation may still work but it'll be so slow
that it might as well not work. Your pop-ups or pages or whatever
will be many minutes if not hours behind. What good is being hours
behind in processing traps?
I'm afraid I don't see any alternatives.
For every system there are limits, and limits imply trade-offs, and
trade-offs imply that you have to find a way to live with what you have.
That's the fundamental law of system performance. If you
cannot find a way to produce more resources to handle the load when it
occurs, then you have to reduce the load. And that's what MLM does.
Even if we multi-threaded trapd to take over the thresholding job,
at some point he too would have to make a decision about what to do when
the load was too high. And I'll bet the decision would be to stop
processing duplicate traps in order to protect every process that
comes down the line afterward.
In short, I think what you want to test
is a good idea. Just don't be surprised if you don't find broken
code at the end of it, but rather system and design limitations.
Want a script to test with? Here's
one of mine which uses snmptrap, and sends any number of simulated Cisco
LinkDown traps with the variable content modified so that for any given
one, I can tell where it falls in the batch sent. I call it "EventBlast"
and you invoke it like this,
EventBlast
<number of traps to send> <target NetView>
#!/bin/ksh
max=$1
target=$2
src=""
event=ciscoLinkDown
# set -x
let count=0
while (($count < $max)) ; do
/usr/OV/bin/snmptrap $target .1.3.6.1.4.1.9
\
$src 2 0 1 \
.1.3.0 Integer $count \
.1.4.0 OctetStringascii "`date`"
\
.1.5.0 Integer $max \
.1.6.0 OctetStringascii "blast
test mode"
# sleep 1
let count=$count+1
echo "sent $event EventBlast$count
to $target "
done
Of course, you can modify this to send
any other trap, with any other variables you need to test your rulesets.
I sincerely hope this helps.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
- Re: [nv-l] Stress Testing NV, looking for opinions, Brett Coley
- RE: [nv-l] Stress Testing NV, looking for opinions, Barr, Scott
- RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
- RE: [nv-l] Stress Testing NV, looking for opinions, Barr, Scott
- RE: [nv-l] Stress Testing NV, looking for opinions,
James Shanks <=
- RE: [nv-l] Stress Testing NV, looking for opinions, Barr, Scott
- RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
- RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
- RE: [nv-l] Stress Testing NV, looking for opinions, Barr, Scott
- RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
|
|
|