nv-l

Re: trapd, nvcorrd, actiond hangs

1999-12-03 07:43:18
Subject: Re: trapd, nvcorrd, actiond hangs
From: James Shanks <James_Shanks AT TIVOLI DOT COM>
To: nv-l AT lists.tivoli DOT com
Date: Fri, 3 Dec 1999 07:43:18 -0500
It is evident from your description that there is much about ruleset processing
that you do not understand.
Let me see if I can help.  If not, you should call Support, since you have a
problem which would require  expert attention.

First, a ruleset cannot hang trapd nor can it filter events from being processed
by trapd because it is run by nvcorrd who get the events from trapd.  Ruleset
processing occurs after the events leave trapd, so if he is hung you must look
elsewhere.  Are you see events being written to the trapd log?  If so, then he
is not hung.  If no, then issue the command trapd -T from the command line and
look in trapd.trace.  Is it full of messages which say "event queued"?
If so, then the problem is that events are arriving much too fast for trapd to
process, so all he can do is queue them until the rate falls off.  Typically a
trapd slow down is caused by either (1) too many traps at once or (2) name
resolution problems,

Second, where is ruleset being run?  From ESE.automation?  That would be the
logical place, yet it has a Forward correlation node in it.  This is very bad,
because it means that actionsvr, who registered this ruleset, will be getting
events (rather than commands) on his socket, and he has no way to de-queue them.
Eventually his socket will fill and he will hang.  Then nvcorrd, who runs the
ruleset, will fill up his socket, because he cannot pass events to actionsvr,
and he will hang.  I am not sure what netstat -a shows you on Solaris, but on
AIX it shows you the send and receive queues of the sockets, and you can see if
any of them are backed up.

But if this rule is being run from a display window, make sure only one person
runs it, or you will get duplicate processing, and all those queries to ovwdb
and set commands to ovwdb will be duplicated , and that would also slow things
to a crawl.

Third, your ruleset is itself very inefficient.  You have as the very first node
a Query Collection.  That means that all traps, even the ones which are internal
or Log Only will be passed to this check, and nvcorrd will have to wait while
ovwdb returns an answer to the query.  Since you follow the Query Collection
with a Trap Settings node, you should change that order and only pass to the
Query Collection what it must resolve.  I am not sure (because we ruleset syntax
was not made for humans but for the ruleset editor -- which is why you may want
to call Support to get more help) but it also appears that you have the same
situation with the Field Compare node.  You should precede that with the trap
setting, not follow it.


That's about all I can see from what you have sent.   If you need more, I
suggest  a PMR, but if you have specific questions, please feel free to ask.

James Shanks
Tivoli (NetView for UNIX) L3 Support



Åsa Berglund <berglund AT RALEIGH.IBM DOT COM> on 12/03/99 03:51:56 AM

Please respond to Discussion of IBM NetView and POLYCENTER Manager on NetView
      <NV-L AT UCSBVM.UCSB DOT EDU>

To:   NV-L AT UCSBVM.UCSB DOT EDU
cc:    (bcc: James Shanks/Tivoli Systems)
Subject:  trapd, nvcorrd, actiond hangs



Hi!

We are running NetView for NT  5.1.1 on Solaris 2.6
.
Environment:
Solaris 2.6

A ruleset has been written with the following functionality:
If a node down event is recieved, a ruleset is activated that will
cancel specific actions (SMS to a mobile phone) if a node up event is
recieved within ten minutes. If no node up event is detected an SMS will

be sent.

There is a strong filtering of events into this rule, a maximum of ~30
events may be out "on hold". Despite this, trapd, nvcorrd, actiond
hangs.

This error has happend only ones, but I
´m  are interested to hear
your thoughts on why this is happening. Mayby I missed something, but
I sure can not figure out what.. I´ll appreciate any feedback.

This is the ruleset:

  RuleSet2 RuleSet NVQryColl3 NVQryColl11
"" 0
NVQryColl3 NVQryColl TrapID4
2 ALLT_BRYGGOR 1 ""
TrapID4 TrapID AttrDelay5 SetNVField10
netView6000 1.3.6.1.4.1.2.6.3 "6 " "58916865 " 0 "netView6000
1.3.6.1.4.1.2.6.3" "REMNODER_NERE           Specific 58916865      " ""
0
AttrDelay5 AttrDelay SetNVField6 Action7 Action8 ForwardCorr9
"" 0 "" 720 "" 0 0 0 "2 2 0~"
SetNVField6 SetNVField
ericsson_node aktiv "" 0 0 0 0 2 ""
Action7 Action
" /usr/OV/bin/ovxecho -d 130.100.31.201:0 $NVATTR_2 NERE Haremlarm!!" ""

Action8 Action
"/ericsson/script/larm.sh $NVATTR_2 NERE harem nw-larm " ""
ForwardCorr9 ForwardCorr
""
SetNVField10 SetNVField
ericsson_node vent "" 0 0 0 0 2 ""
NVQryColl11 NVQryColl TrapID12
2 ALLT_BRYGGOR 1 ""
TrapID12 TrapID AttrDelay5.2 NVFldCmp13 NVFldCmp14 ForwardCorr17
netView6000 1.3.6.1.4.1.2.6.3 "6 " "58916864 " 0 "netView6000
1.3.6.1.4.1.2.6.3" "REMNODER_UPPE           Specific 58916864      " ""
0
NVFldCmp13 NVFldCmp
ericsson_node vent "" 0 1 1 0 2 0 ""
NVFldCmp14 NVFldCmp Action15 Action16
ericsson_node aktiv "" 0 1 1 0 2 0 ""
Action15 Action
"/usr/OV/bin/ovxecho -d 130.100.31.201:0 $NVATTR_2 UPPE Harem NW-Larm"
""
Action16 Action
"/ericsson/script/larm.sh $NVATTR_2 UPPE harem NW-Larm" ""
ForwardCorr17 ForwardCorr
""

Regards,
Åsa Berglund
Tivoli Nordic Team
IBM Sweden


<Prev in Thread] Current Thread [Next in Thread>