I'll try to keep this simple James, and answer your
questions at the same time, here is the flow:
Mainframe NetMaster
Enterprise trap when an SNA device
fails
Received by NetView
Trigger ruleset via ESE.automation that calls
script
Script parses event picking out important data (SNA PU NAME
& STATUS)
Script uses a TCP socket connection to a listening script
Listening script interrogates it's hash table of 1100+
devices for name and location of the client affected
Listening script issues our own trap (i.e. node down or
node up)
The listening script is used because I wanted to avoid
having to load a hash of 1100 customers (or do equivalent file I/O) in the event
of large scale outages. When we IPL the mainframe, we are going to receive
events on ALL SNA PUs and spawning several hundred copies of the script loading
the hash with 1100 customers would be an incredible resource hog. So I have the
listening script load the hash and run like a daemon and accept requests from
small individual scripts that have parsed out the relavent data.
The logging shows this:
Trapd.log shows all 34 down events and all 34 up events
from the mainframe (duration beyond the timers)
The small script which parses logged all 34 down and all 34
up events
The listener program generated all 34 down and all 34 up
events (the ones the timers care about)
A second ruleset is used to catch the listener-generated
node down and up events and trigger the notification script to TEC (it appears
not all resulted in triggering the notification script)
Notification to TEC only occured on 12.
TEC console only shows 12 up events and leaves the
remainder as open.
So, one of two conditions exist. My listener program did
receive all the events, and did generate the traps. Therefore, either ruleset
correlation was only able to correlate a maximum of 12 (and thus did not fire
the notification script), OR the notification script has problems generate 34
calls to TEC (we use postemsg, not TEC forwarding). I would rule out the
listener program having an issue on the basis that it was able to generate all
the down and up traps even during the heaviest of volumes I have observed.
Somewhere, the ruleset correlation failed, or the TEC postemsg failed.
As far as actionsvr
firing up 34/35 processes, that should be okay. These NetView servers have dual
1.0 Ghz processors and 2gb of memory. We have other "storm-like" situations that
we handle a volume equal to or larger than this. In those cases though, I don't
have the hold-down timers and the second ruleset.
Sorry if this is complicated, I was trying to
conservative with system resources by using this listener program. All code is
in PERL btw. One problem I have is I cannot test this without nuking some large
number of customers and my management seems to frown on production outages to
test event notification. Go figure.
Well, it is awfully difficult to
try to diagnose your situation without knowing how the code you have designed
actually works.
Did the ruleset
fire correctly on every event? Your best bet is to turn on nvcorrd tracing (nvcdebug -d all) after
nvcorrd starts so you can look at the logs. If they toggle too quickly,
then you'll have to start nvcorrd with the -l <logfile> parameter
so he just writes to one huge log until you stop him. The logs will show
what actually happens inside him and whether the rulesets worked
properly.
Did the scripts get
launched? If you think you already
know that they did, and these notifications are sent via a scripts run by
actionsvr, then it is time to look at the nvaction logs. Note that the way
actionsvr operates is that he spawns a child for ever action he runs, so if
you are expecting 34 concurrent notifications, you'll get up to 35 actionsvr
processes running concurrently, the main one and 34 children. There's
no magic number of actionsvr processes that can run at one time; that's up to
your operating system limits. But actionsvr will cancel his children
however, if they don't complete in 999 seconds.
Hope this helps.
James Shanks Level 3 Support for Tivoli NetView for UNIX
and Windows Tivoli Software / IBM Software Group
"Barr, Scott"
<Scott_Barr AT csgsystems DOT com> Sent by: owner-nv-l AT lists.us.ibm DOT com
05/28/2004 10:08 AM
|
To
| <nv-l AT lists.us.ibm DOT com>
|
cc
|
|
Subject
| [nv-l] Ruleset
Correlation |
|
Greetings - NetView 7.1.3 & Solaris 2.8 I am working
through some automation performance issues and I observed something
disturbing. I have automation that receives SNA mainframe events, parses and
formats the trap and writes it to a log. It also uses snmptrap to generate a
psuedo "node down" trap. When a corresponding up event is received for the
same SNA device I use snmptrap to send an "up" event. A second ruleset
performs correlation on the up and down events so that if the duration between
the up and down events is less than 10 minutes, it gets tossed, otherwise a
notification script is called that wakes up the help desk. What disturbs me
is the behavior I see when we have a significant outage - in my sample case,
34 SNA devices dropped at one time. When the corresponding up messages
occured, everything worked properly except the notifications. The duration of
the outage exceeded the time in pass on match/resset on match timers but only
12 up notifications occured. According to my application log and trapd.log,
the 34 "up" events got generated but the notifications did not. What I
am wondering is whether there is a limit to the number of outstanding
correlated events, i.e. how many devices can be waiting for a node up? Is it
possible only 12 pairs of node down/ups can be outstanding? Is there a way to
look at whave events automation (and I'm not sure if it's nvcorrd, actionsvr
or ovactiond thats involved) still has outstanding?
|