Re: [nv-l] Ruleset Correlation
2004-05-28 11:45:14
Well, it is awfully difficult to try
to diagnose your situation without knowing how the code you have designed
actually works.
Did the ruleset fire correctly on every
event?
Your best bet is to turn on nvcorrd
tracing (nvcdebug -d all) after nvcorrd starts so you can look at the logs.
If they toggle too quickly, then you'll have to start nvcorrd with
the -l <logfile> parameter so he just writes to one huge log
until you stop him. The logs will show what actually happens inside
him and whether the rulesets worked properly.
Did the scripts get launched?
If you think you already know that they
did, and these notifications are sent via a scripts run by actionsvr, then
it is time to look at the nvaction logs. Note that the way actionsvr operates
is that he spawns a child for ever action he runs, so if you are expecting
34 concurrent notifications, you'll get up to 35 actionsvr processes running
concurrently, the main one and 34 children. There's no magic number
of actionsvr processes that can run at one time; that's up to your operating
system limits. But actionsvr will cancel his children however, if
they don't complete in 999 seconds.
Hope this helps.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
"Barr, Scott"
<Scott_Barr AT csgsystems DOT com>
Sent by: owner-nv-l AT lists.us.ibm DOT com
05/28/2004 10:08 AM
|
To
| <nv-l AT lists.us.ibm DOT com>
|
cc
|
|
Subject
| [nv-l] Ruleset Correlation |
|
Greetings - NetView 7.1.3 &
Solaris 2.8
I am working through some automation
performance issues and I observed something disturbing. I have automation
that receives SNA mainframe events, parses and formats the trap and writes
it to a log. It also uses snmptrap to generate a psuedo "node down"
trap. When a corresponding up event is received for the same SNA device
I use snmptrap to send an "up" event. A second ruleset performs
correlation on the up and down events so that if the duration between the
up and down events is less than 10 minutes, it gets tossed, otherwise a
notification script is called that wakes up the help desk.
What disturbs me is the behavior
I see when we have a significant outage - in my sample case, 34 SNA devices
dropped at one time. When the corresponding up messages occured, everything
worked properly except the notifications. The duration of the outage exceeded
the time in pass on match/resset on match timers but only 12 up notifications
occured. According to my application log and trapd.log, the 34 "up"
events got generated but the notifications did not. What I am wondering
is whether there is a limit to the number of outstanding correlated events,
i.e. how many devices can be waiting for a node up? Is it possible only
12 pairs of node down/ups can be outstanding? Is there a way to look at
whave events automation (and I'm not sure if it's nvcorrd, actionsvr or
ovactiond thats involved) still has outstanding?
|
|
|