RE: [nv-l] Stress Testing NV, looking for opinions
2004-06-03 13:27:48
You and I are on exactly the same page. I have a PMR opened
for exactly the same situation.
In our situation, we have a automation testing routine
(basically a cron that submits a trap that drives a ruleset that touches a file)
if the file has not been touched in greater than 30 seconds from the time the
trap was submitted, we declare automation is taking too long and stop nvcorrd
and restart it. Upon occasion, we see an indication that nothing can talk to
nvcold - this sounds a lot like what you are seeing - non-responsive nvcold
behavior.
I know for a fact that our automation runs around 40-60
traps a minute. Rates beyond that may exist, but I haven't measured. In one
recent situation, a mis-behaving trap agent (Oracle 9i intelligent agent) began
spewing malformed traps. NetView automation hung in there even though trapd was
being hit with 227 traps a second. The traps were malformed, and the enterprise
ID was not present, so automation was not invoked. Normal traps flowed through
the system during this incident, albeit a little slowly.
And IBM, if you are listening, MLM will do me no darn good
for trap storm protection. The basic problem is that the need to predefine
filter criteria essentially means that I must experience a trap storm from a
device once, then put a filter in, and then if the same trap storm occurs again,
the filter will choke it. Advocates of MLM will point out that I can configure
it based on host address (or even *.*.*.*) but that in essence will shut NetView
automation down (as MLM won't be sending traps beyond the threshold rate) so it
serves as no valid protection (since it breaks the automation just as if I had
sent the traps through).
I strongly believe that nvcold has a problem - even with
the test fix I received for the memory leak someone else mentioned in the
forum.
One other suggestion I had: While you are cranking up your
trap rate - if you are using query smartset - take a snapshot of netstat -a and
see if you see a ton of TIME_WAIT, CLOSE_WAIT and FIN2_WAIT sockets. It seems to
me that this is somehow related. There seems to be a ton of nvcold sockets in
use. I do not believe the trap rate itself is the problem, I think it is the
number of simultaneous query smartset operations.
I'll be interested in any further results you are willing
to share.
I
got the base script from IBM support and have no problem sharing if someone
from IBM weighs in with no objections. We are running these traps through
TEC_ITS.rls, so nvcorrd, etc. should be getting exercised. I would like
to put a mix of traps in as well, but am not a developer so I'm making do
right now.
Funny you mention Query Smartset node; we are pretty sure this was
the major source of our trouble. Ours happened to be there for no good reason,
so we removed it and cycled the daemons. In addition, we did minor things like
configure trapd to save logs for a week, and implemented a weekly
ovmapcount/ovtopofix process. NV has been smooth ever since. Until then, NV had been hanging at least
once/week, and we were thinking NV was choking on the number of traps, which
we now believe to be bunk based on testing. MLM was considered to be the
solution until we learned our addressing scheme was not compatible. That's
when we opened a support call--been at this for about a month
now.
We're also ready to up the number of traps to see where NV falls over.
When this started, we got information from support that NV could handle
sustained 6-8 traps/second. I've got the email somewhere... It appears
that number is conservative.
One other thing - the use of smartsets and rulesets
heavily affects performance. It would be beneficial if your testing included
a variety of traps, not the same one over and over. In addition, pushing
them through rulesets if possible would be a real good stress test
especially if you have rulesets doing a "query smartset"
node.
Would you be willing to share your script that
generates the traps? I am interested in doing the same
thing.
That sounds like a valid way to test, but I'm thinking
you may want to throw in some more randomness, maybe
some heavier peaks. Sounds like the 250 in 50 secs are
dealt with ok in their 10 minute window, but what happens
with a burst of 1000 thrown into the mix?
Regards,
Brett
Tivoli Software/IBM
This message (including any attachments) contains confidential information
intended for a specific individual and purpose, and is protected by law. If
you are not the intended recipient, you should delete this message. Any
disclosure, copying, or distribution of this message, or the taking of any
action based on it, is strictly prohibited.
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
- Re: [nv-l] Stress Testing NV, looking for opinions, Brett Coley
- RE: [nv-l] Stress Testing NV, looking for opinions, Barr, Scott
- RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
- RE: [nv-l] Stress Testing NV, looking for opinions,
Barr, Scott <=
- RE: [nv-l] Stress Testing NV, looking for opinions, Barr, Scott
- RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
- RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
- RE: [nv-l] Stress Testing NV, looking for opinions, Barr, Scott
- RE: [nv-l] Stress Testing NV, looking for opinions, Van Order, Drew (US - Hermitage)
|
|
|