RE: [nv-l] Stress Testing NV, looking for opinions

You and I are on exactly the same page. I have a PMR opened for exactly the same situation.

In our situation, we have a automation testing routine (basically a cron that submits a trap that drives a ruleset that touches a file) if the file has not been touched in greater than 30 seconds from the time the trap was submitted, we declare automation is taking too long and stop nvcorrd and restart it. Upon occasion, we see an indication that nothing can talk to nvcold - this sounds a lot like what you are seeing - non-responsive nvcold behavior.

I know for a fact that our automation runs around 40-60 traps a minute. Rates beyond that may exist, but I haven't measured. In one recent situation, a mis-behaving trap agent (Oracle 9i intelligent agent) began spewing malformed traps. NetView automation hung in there even though trapd was being hit with 227 traps a second. The traps were malformed, and the enterprise ID was not present, so automation was not invoked. Normal traps flowed through the system during this incident, albeit a little slowly.

And IBM, if you are listening, MLM will do me no darn good for trap storm protection. The basic problem is that the need to predefine filter criteria essentially means that I must experience a trap storm from a device once, then put a filter in, and then if the same trap storm occurs again, the filter will choke it. Advocates of MLM will point out that I can configure it based on host address (or even *.*.*.*) but that in essence will shut NetView automation down (as MLM won't be sending traps beyond the threshold rate) so it serves as no valid protection (since it breaks the automation just as if I had sent the traps through).

I strongly believe that nvcold has a problem - even with the test fix I received for the memory leak someone else mentioned in the forum.

One other suggestion I had: While you are cranking up your trap rate - if you are using query smartset - take a snapshot of netstat -a and see if you see a ton of TIME_WAIT, CLOSE_WAIT and FIN2_WAIT sockets. It seems to me that this is somehow related. There seems to be a ton of nvcold sockets in use. I do not believe the trap rate itself is the problem, I think it is the number of simultaneous query smartset operations.

I'll be interested in any further results you are willing to share.

From: owner-nv-l AT lists.us.ibm DOT com [mailto:owner-nv-l AT lists.us.ibm DOT com] On Behalf Of Van Order, Drew (US - Hermitage)
Sent: Thursday, June 03, 2004 11:36 AM
To: nv-l AT lists.us.ibm DOT com
Subject: RE: [nv-l] Stress Testing NV, looking for opinions

I got the base script from IBM support and have no problem sharing if someone from IBM weighs in with no objections. We are running these traps through TEC_ITS.rls, so nvcorrd, etc. should be getting exercised. I would like to put a mix of traps in as well, but am not a developer so I'm making do right now.

Funny you mention Query Smartset node; we are pretty sure this was the major source of our trouble. Ours happened to be there for no good reason, so we removed it and cycled the daemons. In addition, we did minor things like configure trapd to save logs for a week, and implemented a weekly ovmapcount/ovtopofix process. NV has been smooth ever since. Until then, NV had been hanging at least once/week, and we were thinking NV was choking on the number of traps, which we now believe to be bunk based on testing. MLM was considered to be the solution until we learned our addressing scheme was not compatible. That's when we opened a support call--been at this for about a month now.

We're also ready to up the number of traps to see where NV falls over. When this started, we got information from support that NV could handle sustained 6-8 traps/second. I've got the email somewhere... It appears that number is conservative.

-----Original Message-----
From: owner-nv-l AT lists.us.ibm DOT com [mailto:owner-nv-l AT lists.us.ibm DOT com] On Behalf Of Barr, Scott
Sent: Thursday, June 03, 2004 11:03 AM
To: nv-l AT lists.us.ibm DOT com
Subject: RE: [nv-l] Stress Testing NV, looking for opinions

One other thing - the use of smartsets and rulesets heavily affects performance. It would be beneficial if your testing included a variety of traps, not the same one over and over. In addition, pushing them through rulesets if possible would be a real good stress test especially if you have rulesets doing a "query smartset" node.

Would you be willing to share your script that generates the traps? I am interested in doing the same thing.

From: owner-nv-l AT lists.us.ibm DOT com [mailto:owner-nv-l AT lists.us.ibm DOT com] On Behalf Of Brett Coley
Sent: Thursday, June 03, 2004 10:44 AM
To: nv-l AT lists.us.ibm DOT com
Subject: Re: [nv-l] Stress Testing NV, looking for opinions

That sounds like a valid way to test, but I'm thinking

you may want to throw in some more randomness, maybe

some heavier peaks. Sounds like the 250 in 50 secs are

dealt with ok in their 10 minute window, but what happens

with a burst of 1000 thrown into the mix?

Regards,

Brett

bcoley AT us.ibm DOT com

Tivoli Software/IBM

This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited.