nv-l

Re: Sending Interface Down & Interface up pages and correlation

2000-01-18 13:34:52
Subject: Re: Sending Interface Down & Interface up pages and correlation
From: "Boyles, Gary P" <gary.p.boyles AT INTEL DOT COM>
To: nv-l AT lists.tivoli DOT com
Date: Tue, 18 Jan 2000 10:34:52 -0800
Well, as long as we're telling some of the different "twists" that we
each implement... I'll add mine!

When an interface goes down... I run a perl script that does a re-test on
the interface, and then (assuming state doesn't change)... creates a file,
and then puts an entry into the file.  Filename = nodename.  Entry example:
        132.233.1.201   Critical

The program then sleeps for 3 minutes.

If more interface-events for that node occur, then the program sees the
file is already created, and just adds entries to the existing file, and
then exits. For example, after 2 minutes (for a downed router) the file
might look like:
        132.233.1.201   Critical
        132.233.1.202   Critical
        132.233.1.203   Critical
        132.233.1.204   Critical
        132.233.1.205   Critical

If during that same time... one (or more) of the interfaces comes back up,
the file might look like:
        132.233.1.201   Critical
        132.233.1.202   Critical
        132.233.1.203   Critical
        132.233.1.204   Critical
        132.233.1.205   Critical
        132.233.1.202   Normal

After the three minutes... the program that created the file wakes up, reads
the file, tallies up all events, gets rid of any interfaces that have fully
transistioned (i.e. gone from down/up), and sends out a single page like...
Node XYZ.  5 Interfaces.  4 Down; 1 Up; 1st Down=132.233.1.201

E-mail is also sent out, and includes full interface descriptions, along
with
sys-location, sys-description, and object-type  (e.g. isIPRouter).

The file is then deleted, and the process for that node will start over
again
when the next interface-event comes in.

In addition, I mimic being an MLM... by forwarding MLM traps... so that I
can
forward-status from one NetView system to another  (for each interface
event).

This logic takes care of:
        a)  Re-testing each interface.
        b)  Interface Bouncing.
        c)  Interface Summarization.
        d)  Status Propogation (forwarding) to other NetView systems.


What can I say... when the ruleset editor came out I tried and I tried...
but
I just couldn't get everything I wanted out of it.

Perl works much better for me... especially since we've gone from NetView
AIX
to NetView NT  (where the ruleset editor just isn't the same)!

Regards,

Gary Boyles, Intel



-----Original Message-----
From: Ray Schafer [mailto:schafer AT TKG DOT COM]
Sent: Tuesday, January 18, 2000 9:58 AM
To: NV-L AT UCSBVM.UCSB DOT EDU
Subject: Re: Sending Interface Down & Interface up pages and correlation


I have the same requirement, but with a little twist.  I need to make sure
that the interface down event is really down.  At one of our customer sites,
we are managing the network infrastructure - not a lot of devices, but all
have multiple interfaces.  A lot of the time when we get an interface down
event it is because the router is busy and doesn't always answer an ICMP
echo
request in time.  So we have a lot of "false alarms".   We wanted to send a
page only when required.  We have the luxury of running on an AIX box (did
you
know AIX 4.3.3 ships with perl now?) and can run snmp queries on the
routers.

So, we set up a ruleset that looks for a NetView interface down event from
devices in one of our collections.  When we get one we fire off an action
script.  The action script then can do a lot of things which can't be done
in
a ruleset node.

The action script gets the NetView Object ID of the interface object (not
the
router Object) from the 4th word of NVATTR_4, and it's IP address from the
1st
word of NVATTR_4.

It then does a ping -c5 of the IP address.  If the ping succeeds (return
code
is 0 if any of the 5 pings get through), we're done - no page.  However if
the
ping fails, we try to gather more info from the router about what it thinks
the state of the interface is.  We need the Interface Index from the router
which we get by this snmp query:  "snmpget $NVA .1.3.6.1.2.1.4.20.1.2.$IP"
where IP is the interfaces' IP address.  A side note is that the routers
have
a software loopback interface defined - an IP address that we are always
able
to get to provided that at least one of it's interfaces is up. It is this IP
address that is associated with the router name in NetView - hence it will
always be $NVA or $NVATTR_2 in the NetView interface down/up trap.

Once we have the Index for the interface we query the admin state (
.1.3.6.1.2.1.2.2.1.7.$INDEX) and the operational state
(.1.3.6.1.2.1.2.2.1.8.$INDEX) of the interface from the router (one query
can
be done for both).  If our query time out for some reason - we page.  If it
gets through, we look at the Administrative state - if it is
administratively
down, we're done - no page.  If it is Admin UP and Oper UP, we're done - no
page.  Only if it is Admin UP but Oper DOWN do we page.

Before we page, we update a database field for the Interface Object marking
it
as being paged.  (This can be done by issuing another custom trap with the
Interface Selection Name as a parameter of the trap and another ruleset, but
we use a custom C program to do it right then and there).  Now instead of
exiting, the action script hangs around and periodically tries to ping the
interface.  If it succeeds it clears the Custom database field for the
object
and sends an all clear page. (Again this can be done with a custom trap and
another ruleset, but we just use a custom C program to clear the field from
within the action script).

This approach is nice because you don't have to worry about looking for
Interface up and doing the query.  It gets around the problem of not being
able to update a field on the Interface Object (rahter than the parent
object)
when processing the Interface Down/Up trap.  Since the script knows when it
paged, it can send the all clear - no need to correlate anything.  The only
reason we store anything in the Interface Object is so we can resume
monitoring in the event of a crash or forced exit of the action script.  You
can also create a collection that will show you outstanding pages using this
field.

One issue with this approach - or any approach that keeps the action script
around for a while is that as a child of actionsrv, the script inherits the
file descriptors - including the socket that actionsrv has open.  If
actionsrv
is stopped while a script is executing, you can't start it until the script
has exited (socket in use error).  Since the monitoring info is saved in a
custom field, we can easily resume these monitors when we need to kill the
scripts.


Leslie Clark wrote:

> I think I understand what  Patrick is looking for, since I have just
> started to
> look at the same question. If  a down event comes in, and no up event
> within
> the specified time, you want to send a page (for instance).  That is the
> part
> everyone seems to agree on.
> A little later, the up event does come in, and you want to send the
> all-clear page.
> But only if the down page was sent in the  first place. It seems like it
> ought to
> work, but I worry about the long caching.  What do you think about that,
> James?
>
> This is how I understand Steve's suggestion:
>
> Node down is input 1 for reset-on-match (5 min)
> Node up is input 2 for same.
> Outputs of the reset-on-match  go to:
>     1) Send the down page
>     2) and also input 1 for a pass-on-match (long time)
> The same Node up is also input 2 for the pass on match
> Output for the pass-on-match is send the up page. The trap
> info available would be from the down event, not the up event,
> but you would know that and could act accordingly.
>
> Patrick, I vote that you verify this for us. Steve, is this something that
> you are
> actually running?
>
> By the way, my current customer tells me that there are real dollars to be
> saved by preventing unneccessary pages...
>
> Cordially,
>
> Leslie A. Clark
> IBM Global Services - Systems Mgmt & Networking
> Detroit
>
> ---------------------- Forwarded by Leslie Clark/Southfield/IBM on
> 01/18/2000 01:40 AM ---------------------------
>
> James Shanks <James_Shanks AT TIVOLI DOT COM>@UCSBVM.UCSB.EDU> on 01/17/2000
> 08:40:00 PM
>
> Please respond to Discussion of IBM NetView and POLYCENTER Manager on
>       NetView <NV-L AT UCSBVM.UCSB DOT EDU>
>
> Sent by:  Discussion of IBM NetView and POLYCENTER Manager on NetView
>       <NV-L AT UCSBVM.UCSB DOT EDU>
>
> To:   NV-L AT UCSBVM.UCSB DOT EDU
> cc:
> Subject:  Re: Sending Interface Down & Interface up pages and correlation
>
> Patrick -
>
> I am not certain that I understand what your second case is for, and Steve
> Francis has given you a suggestion which may work, in any case, but I
> thought I
> would comment on your questions.
>
> You guessed correctly  about how the Reset-on-Match and Pass-on-Match
> functions
> work with incoming events.  I tried to clarify that in my second append
> last
> week.  Only events of the type connected to  Slot 1 are  held in cache.
> The
> Slot 2 event is  used to evaluate the events in the cache as soon as it is
> received.    The Slot 2 events are not cached at all, and once used, they
> are
> discarded unless you added additional processing for them, which is why
> your
> ruleset doesn't handle your second case.  If no matches are received
during
> the
> time interval, the cache is flushed, and the appropriate action taken for
> the
> Slot 1 event -- for Reset, it is passed along to the next ruleset node,
for
> Pass, it is dropped.
>
> James Shanks
> Tivoli (NetView for UNIX) L3 Support

--
Ray Schafer                   | schafer AT tkg DOT com
The Kernel Group              | Distributed Systems Management
http://www.tkg.com