RE: [nv-l] Ruleset Correlation
2004-05-28 13:18:33
Scott,
I am slow-witted and I need lots of
time to think about complex scenarios, so please bear with me if I belabor
the obvious or just don't seem to "get it", especially at first.
Your solution is both elaborate and ingenious, and I am not
certain that I comprehend it.
When I asked about code I was more specifically
thinking about how the rulesets are structured, what internal ruleset nodes
are used, because you were saying that you thought something had failed
inside nvcorrd (that's how I interpret what you said about there
being an internal limit of 12). And now you have elaborated on that
by saying,
>"A second ruleset is
used to catch the listener-generated node down and up events and trigger
the notification script to TEC (it appears not all resulted in >triggering
the notification script)."
Isn't that the same as saying that nvcorrd
failed to kick off your script when he should have? You say that
you see all the generated events (so the first rs in ESE.automation appears
to be working), so it is this second rs which you think is failing. Am
I correct? If so, then that is the one we should focus on, how it
is structured, what it does, and so on.
If the second rs is launching postemsg
in an action node, then it seems to me that you have several choices on
how to debug this.
One way is that you could issue postemsg
in a script which echoes a unique value (random number?) to a file so you
could see how many times postemsg was actually called. That would
show whether it was driven as many times as you think. Or is this
what you think you already have with your value of 12?
Another way might be as Stephen Hochstetler
suggested, trying to trap a return code from postemsg. If postemsg
generates a non-zero return code, and actionsvr kicked off the script,
then actionsvr will capture that message and log it to the nvaction.log.
Every action gets a unique sequential id number, and if the script or command
returns a non-zero return code then any messages which accompanied that
return code are also displayed with that id number, so you can see
what happened to each execution.
Or you can trace the execution of just
this second ruleset in nvcorrd by giving it a special reserved name. "nvcdebug.rs".
Then you can look in the logs and see whether it works right or not,
whether each of the triggering events caused an action to be queued to
actionsvr or not. If you see the action being sent, then you know
the problem is not in nvcorrd. From there you could see the action
command arriving in the nvaction.log and watch what happens after that.
How would you test this without taking
down devices? With snmptrap. Your initial ruleset works off
a trap, so does your second. So you should be able to build an snmptrap
command to mimic them. The code won't know that there wasn't a real
outage if you launch a test script with 34 snmptrap commands in it.
Unsure how to build this? Just ask.
Complicated? Difficult? Perhaps,
but no more so than what you have already built. That's the problem
with disaster recovery tools -- it's always hard to test them.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
"Barr, Scott"
<Scott_Barr AT csgsystems DOT com>
Sent by: owner-nv-l AT lists.us.ibm DOT com
05/28/2004 11:47 AM
|
To
| <nv-l AT lists.us.ibm DOT com>
|
cc
|
|
Subject
| RE: [nv-l] Ruleset Correlation |
|
I'll try to keep this simple James,
and answer your questions at the same time, here is the flow:
Mainframe NetMaster
Enterprise trap when an SNA device
fails
Received by NetView
Trigger ruleset via ESE.automation
that calls script
Script parses event picking out
important data (SNA PU NAME & STATUS)
Script uses a TCP socket connection
to a listening script
Listening script interrogates
it's hash table of 1100+ devices for name and location of the client affected
Listening script issues our own
trap (i.e. node down or node up)
The listening script is used because
I wanted to avoid having to load a hash of 1100 customers (or do equivalent
file I/O) in the event of large scale outages. When we IPL the mainframe,
we are going to receive events on ALL SNA PUs and spawning several hundred
copies of the script loading the hash with 1100 customers would be an incredible
resource hog. So I have the listening script load the hash and run like
a daemon and accept requests from small individual scripts that have parsed
out the relavent data.
The logging shows this:
Trapd.log shows all 34 down events
and all 34 up events from the mainframe (duration beyond the timers)
The small script which parses
logged all 34 down and all 34 up events
The listener program generated
all 34 down and all 34 up events (the ones the timers care about)
A second ruleset is used to catch
the listener-generated node down and up events and trigger the notification
script to TEC (it appears not all resulted in triggering the notification
script)
Notification to TEC only occured
on 12.
TEC console only shows 12 up events
and leaves the remainder as open.
So, one of two conditions exist.
My listener program did receive all the events, and did generate the traps.
Therefore, either ruleset correlation was only able to correlate a maximum
of 12 (and thus did not fire the notification script), OR the notification
script has problems generate 34 calls to TEC (we use postemsg, not TEC
forwarding). I would rule out the listener program having an issue on the
basis that it was able to generate all the down and up traps even during
the heaviest of volumes I have observed. Somewhere, the ruleset correlation
failed, or the TEC postemsg failed.
As far as actionsvr firing up
34/35 processes, that should be okay. These NetView servers have dual 1.0
Ghz processors and 2gb of memory. We have other "storm-like"
situations that we handle a volume equal to or larger than this. In those
cases though, I don't have the hold-down timers and the second ruleset.
Sorry if this is complicated,
I was trying to conservative with system resources by using this listener
program. All code is in PERL btw. One problem I have is I cannot test this
without nuking some large number of customers and my management seems to
frown on production outages to test event notification. Go figure.
From: owner-nv-l AT lists.us.ibm DOT com
[mailto:owner-nv-l AT lists.us.ibm DOT com] On Behalf Of James Shanks
Sent: Friday, May 28, 2004 10:27 AM
To: nv-l AT lists.us.ibm DOT com
Subject: Re: [nv-l] Ruleset Correlation
Well, it is awfully difficult to try to diagnose your situation without
knowing how the code you have designed actually works.
Did the ruleset fire correctly on every event?
Your best bet is to turn on nvcorrd tracing (nvcdebug -d all) after nvcorrd
starts so you can look at the logs. If they toggle too quickly, then
you'll have to start nvcorrd with the -l <logfile> parameter
so he just writes to one huge log until you stop him. The logs will
show what actually happens inside him and whether the rulesets worked properly.
Did the scripts get launched?
If you think you already know that they did, and these notifications are
sent via a scripts run by actionsvr, then it is time to look at the nvaction
logs. Note that the way actionsvr operates is that he spawns a child for
ever action he runs, so if you are expecting 34 concurrent notifications,
you'll get up to 35 actionsvr processes running concurrently, the main
one and 34 children. There's no magic number of actionsvr processes
that can run at one time; that's up to your operating system limits.
But actionsvr will cancel his children however, if they don't complete
in 999 seconds.
Hope this helps.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
"Barr, Scott"
<Scott_Barr AT csgsystems DOT com>
Sent by: owner-nv-l AT lists.us.ibm DOT com
05/28/2004 10:08 AM
|
To
| <nv-l AT lists.us.ibm DOT com>
|
cc
|
|
Subject
| [nv-l] Ruleset Correlation |
|
Greetings - NetView 7.1.3 & Solaris 2.8
I am working through some automation performance issues and I observed
something disturbing. I have automation that receives SNA mainframe events,
parses and formats the trap and writes it to a log. It also uses snmptrap
to generate a psuedo "node down" trap. When a corresponding up
event is received for the same SNA device I use snmptrap to send an "up"
event. A second ruleset performs correlation on the up and down events
so that if the duration between the up and down events is less than 10
minutes, it gets tossed, otherwise a notification script is called that
wakes up the help desk.
What disturbs me is the behavior I see when we have a significant outage
- in my sample case, 34 SNA devices dropped at one time. When the corresponding
up messages occured, everything worked properly except the notifications.
The duration of the outage exceeded the time in pass on match/resset on
match timers but only 12 up notifications occured. According to my application
log and trapd.log, the 34 "up" events got generated but
the notifications did not. What I am wondering is whether there is a limit
to the number of outstanding correlated events, i.e. how many devices can
be waiting for a node up? Is it possible only 12 pairs of node down/ups
can be outstanding? Is there a way to look at whave events automation (and
I'm not sure if it's nvcorrd, actionsvr or ovactiond thats involved) still
has outstanding?
|
|
|