nv-l

RE: [nv-l] Stress Testing NV, looking for opinions

2004-06-04 09:25:41
Subject: RE: [nv-l] Stress Testing NV, looking for opinions
From: "Barr, Scott" <Scott_Barr AT csgsystems DOT com>
To: <nv-l AT lists.us.ibm DOT com>
Date: Fri, 4 Jun 2004 08:10:38 -0500
James - GREAT information. Let me digest. I dont' disagree with anything here, but want to reread for detail after I have had some coffee. I think most of us looking at this type of situation agree that it's not a "bug" but an under-performance issue. Essentially, if all components of NetView can handle 8-20 traps a second, then nvcold has to be able to perform somewhere near that level. It shouldn't be an arbitrary governor. Only testing, tracing, benchmarking will really answer the questions, so let me finish my research and I'll post the response here.
 
I feel a lot of the problem has to do with the fact that in 1986 or whenever NetView first came out, we just didn't have networks of the size we have today and we certainly didn't have the connection speeds we have today. Management functions on devices were very limited, now they are very robust (some might say TOO robust). The combination of all these factors means that some basic inherent architectures may be stretched to there limits - i.e it may be time to re-invent the wheel. Even Microsoft recognized that.
 
I understand the implications of all this. No firestorm here. If it turns out IBM has some tough coding work to do, then so be it. If they choose not to address it, the marketplace will respond accordingly. I think we all know that some of the "legacy" aspects of NetView on AIX/UNIX leave something to be desired. We'll see where it goes. Meanwhile, as a group, let's continue to try and iron out all of the possible refinements that can be made in terms of O/S tuning, NetView configuration, and network design.
 
I still think nvcold has a performance issue that keeps it's capability out of scope with the rest of NetView components.


From: owner-nv-l AT lists.us.ibm DOT com [mailto:owner-nv-l AT lists.us.ibm DOT com] On Behalf Of James Shanks
Sent: Thursday, June 03, 2004 5:32 PM
To: nv-l AT lists.us.ibm DOT com
Subject: RE: [nv-l] Stress Testing NV, looking for opinions


At the risk of starting a firestorm, I feel I must respond to some of Scott's questions and issues.

Scott,  I just want to prepare you for what you may find.  

What you may find, is that despite the speed of your processor(s), you are up against both system and old design limitations, which are not easily remedied, rather than proof of some bug in nvcold.

 Well, perhaps, you'll find that, yes.  But perhaps also the end result may be that you will simply find the upper limit of what NetView event processing can handle, given the way it is written today and the amount of work that can be done on your box in that period of time.  As far as I know there are no benchmarks for nvcold performance.   And I know there are none for nvcorrd performance either.  So with this course of action you may be the one determining those benchmarks.

You are correct that socket stats and performance are tied together, but perhaps not in the way that you think.  Those states may not represent errors at all.  Sockets left in states like TIME_WAIT, CLOSE_WAIT and FIN2_WAIT are the result of heavy usage and operating system resources.  Some systems can be tuned to reduce the amount of time between these states, which occur at the end of the communications cycle, when at least one end of the communications pipe has been closed,  though I am not enough of an OS guy to tell you exactly what they mean nor how to tune to reduce them.  But  periodically the OS checks all open sockets and changes the states so that the ones that should be closed go to "CLOSED" over time.   So if you are using nvcold heavily, that's just what I would expect to see if he's opening and closing a lot of sockets.   And he would be doing just if you have a lot of traps running through rulesets with Query Smartset nodes in them.

Every  Query Smartset in a ruleset is just that, a new call to nvcold by nvcorrd.  For each new call, nvcold must then query the object database to determine what smartsets a particular node is in, and return all those in a list.  So performance is going to be determined by  both the size of the database and the number of smartsets to be included.  I'm not savvy about the internals of nvcold, but that's real work, and I suspect all this means sockets to be opened and closed between him and nvcorrd as well as between him and ovwdb.  So for some trap rates, no matter how fast your box is, it may not be fast enough to keep up with the demand being placed on the NetView daemons by your automation.   Let's remember that nvcold, like all the other NetView for UNIX  daemons, except the new java ones, is single-threaded.  That's one operation at time.  So if every trap goes through a Query Smartset, it is easy to see how you could overwhelm the available resources at some point.  The same is true, of course, if they were multi-threaded.  It would just take longer.  But that's one of the reasons why you want to try to try to make calls outside of nvcorrd, like Query Smartset and Query Database, and Query MIB, sparingly when you write a ruleset, as the performance guidelines I posted some time ago emphasize.  

As for MLM and trap storms, most of those we see are indeed repetitive.  In the seven years I have looked at customer logs and traces, they usually come from the same devices over and over again.  They usually come from routers which are overworked and not well-configured, and in many NetView environments, the NetView folks have no control over either one of those things.  But they can configure MLM to do thresholding.  That's not breaking your automation but protecting it; if we only fire it for the first of every ten identical traps rather than for every one, provided that you know when you get the end result that there could be nine more identical triggers behind it.  So MLM is not a panacea, and it does require that you analyze storms which have already happened in order to be effective.  But what other choice is there?   Without MLM thresholding, trapd will just queue the traps until he runs out of storage to hold them; but assuming that doesn't happen, he'll start processing them like mad when the storm stops, and simply pass the bottleneck along to the connected applications.  What will they do? nvcorrd, nvserverd, and actionsvr will then begin processing like mad themselves, but probably not fast enough to stay current.   Your one-trap-at-a-time automation may still work but it'll be so slow that it might as well not work.  Your pop-ups or pages or whatever will be many minutes if not hours behind.  What good is being hours behind in processing traps?

I'm afraid I don't see any alternatives.  For every system there are limits, and limits imply trade-offs, and trade-offs imply that you have to find a way to live with what you have.   That's the fundamental law of system performance.   If you cannot find a way to produce more resources to handle the load when it occurs, then you have to reduce the load.  And that's what MLM does.  Even if we multi-threaded trapd to take over the thresholding job, at some point he too would have to make a decision about what to do when the load was too high.  And I'll bet the decision would be to  stop processing duplicate traps in order  to protect every process that comes down the line afterward.

In short, I think what you want to test is a good idea.  Just don't be surprised if you don't find broken code at the end of it, but rather system and design limitations.

Want a script to test with?  Here's one of mine which uses snmptrap, and sends any number of simulated Cisco LinkDown traps with the variable content modified so that for any given one, I can tell where it falls in the batch sent.  I call it  "EventBlast" and you invoke it like this,
        EventBlast  <number of traps to send> <target NetView>

#!/bin/ksh
max=$1
target=$2
src=""
event=ciscoLinkDown
#       set -x
let count=0
   while (($count < $max)) ; do
       /usr/OV/bin/snmptrap  $target .1.3.6.1.4.1.9  \
       $src  2 0 1 \
        .1.3.0  Integer  $count  \
        .1.4.0  OctetStringascii  "`date`" \
        .1.5.0  Integer  $max  \
        .1.6.0  OctetStringascii  "blast test mode"
#               sleep 1
        let count=$count+1
       echo "sent  $event EventBlast$count to $target "
   done



Of course, you can modify this to send any other trap, with any other variables you need to test your rulesets.

I sincerely hope this helps.

James Shanks
Level 3 Support  for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group

<Prev in Thread] Current Thread [Next in Thread>