Re: Sending Interface Down & Interface up pages and correlation

I have the same requirement, but with a little twist.  I need to make sure
that the interface down event is really down.  At one of our customer sites,
we are managing the network infrastructure - not a lot of devices, but all
have multiple interfaces.  A lot of the time when we get an interface down
event it is because the router is busy and doesn't always answer an ICMP echo
request in time.  So we have a lot of "false alarms".   We wanted to send a
page only when required.  We have the luxury of running on an AIX box (did you
know AIX 4.3.3 ships with perl now?) and can run snmp queries on the routers.

So, we set up a ruleset that looks for a NetView interface down event from
devices in one of our collections.  When we get one we fire off an action
script.  The action script then can do a lot of things which can't be done in
a ruleset node.

The action script gets the NetView Object ID of the interface object (not the
router Object) from the 4th word of NVATTR_4, and it's IP address from the 1st
word of NVATTR_4.

It then does a ping -c5 of the IP address.  If the ping succeeds (return code
is 0 if any of the 5 pings get through), we're done - no page.  However if the
ping fails, we try to gather more info from the router about what it thinks
the state of the interface is.  We need the Interface Index from the router
which we get by this snmp query:  "snmpget $NVA .1.3.6.1.2.1.4.20.1.2.$IP"
where IP is the interfaces' IP address.  A side note is that the routers have
a software loopback interface defined - an IP address that we are always able
to get to provided that at least one of it's interfaces is up. It is this IP
address that is associated with the router name in NetView - hence it will
always be $NVA or $NVATTR_2 in the NetView interface down/up trap.

Once we have the Index for the interface we query the admin state (
.1.3.6.1.2.1.2.2.1.7.$INDEX) and the operational state
(.1.3.6.1.2.1.2.2.1.8.$INDEX) of the interface from the router (one query can
be done for both).  If our query time out for some reason - we page.  If it
gets through, we look at the Administrative state - if it is administratively
down, we're done - no page.  If it is Admin UP and Oper UP, we're done - no
page.  Only if it is Admin UP but Oper DOWN do we page.

Before we page, we update a database field for the Interface Object marking it
as being paged.  (This can be done by issuing another custom trap with the
Interface Selection Name as a parameter of the trap and another ruleset, but
we use a custom C program to do it right then and there).  Now instead of
exiting, the action script hangs around and periodically tries to ping the
interface.  If it succeeds it clears the Custom database field for the object
and sends an all clear page. (Again this can be done with a custom trap and
another ruleset, but we just use a custom C program to clear the field from
within the action script).

This approach is nice because you don't have to worry about looking for
Interface up and doing the query.  It gets around the problem of not being
able to update a field on the Interface Object (rahter than the parent object)
when processing the Interface Down/Up trap.  Since the script knows when it
paged, it can send the all clear - no need to correlate anything.  The only
reason we store anything in the Interface Object is so we can resume
monitoring in the event of a crash or forced exit of the action script.  You
can also create a collection that will show you outstanding pages using this
field.

One issue with this approach - or any approach that keeps the action script
around for a while is that as a child of actionsrv, the script inherits the
file descriptors - including the socket that actionsrv has open.  If actionsrv
is stopped while a script is executing, you can't start it until the script
has exited (socket in use error).  Since the monitoring info is saved in a
custom field, we can easily resume these monitors when we need to kill the
scripts.


Leslie Clark wrote:

> I think I understand what  Patrick is looking for, since I have just
> started to
> look at the same question. If  a down event comes in, and no up event
> within
> the specified time, you want to send a page (for instance).  That is the
> part
> everyone seems to agree on.
> A little later, the up event does come in, and you want to send the
> all-clear page.
> But only if the down page was sent in the  first place. It seems like it
> ought to
> work, but I worry about the long caching.  What do you think about that,
> James?
>
> This is how I understand Steve's suggestion:
>
> Node down is input 1 for reset-on-match (5 min)
> Node up is input 2 for same.
> Outputs of the reset-on-match  go to:
>     1) Send the down page
>     2) and also input 1 for a pass-on-match (long time)
> The same Node up is also input 2 for the pass on match
> Output for the pass-on-match is send the up page. The trap
> info available would be from the down event, not the up event,
> but you would know that and could act accordingly.
>
> Patrick, I vote that you verify this for us. Steve, is this something that
> you are
> actually running?
>
> By the way, my current customer tells me that there are real dollars to be
> saved by preventing unneccessary pages...
>
> Cordially,
>
> Leslie A. Clark
> IBM Global Services - Systems Mgmt & Networking
> Detroit
>
> ---------------------- Forwarded by Leslie Clark/Southfield/IBM on
> 01/18/2000 01:40 AM ---------------------------
>
> James Shanks <James_Shanks AT TIVOLI DOT COM>@UCSBVM.UCSB.EDU> on 01/17/2000
> 08:40:00 PM
>
> Please respond to Discussion of IBM NetView and POLYCENTER Manager on
>       NetView <NV-L AT UCSBVM.UCSB DOT EDU>
>
> Sent by:  Discussion of IBM NetView and POLYCENTER Manager on NetView
>       <NV-L AT UCSBVM.UCSB DOT EDU>
>
> To:   NV-L AT UCSBVM.UCSB DOT EDU
> cc:
> Subject:  Re: Sending Interface Down & Interface up pages and correlation
>
> Patrick -
>
> I am not certain that I understand what your second case is for, and Steve
> Francis has given you a suggestion which may work, in any case, but I
> thought I
> would comment on your questions.
>
> You guessed correctly  about how the Reset-on-Match and Pass-on-Match
> functions
> work with incoming events.  I tried to clarify that in my second append
> last
> week.  Only events of the type connected to  Slot 1 are  held in cache.
> The
> Slot 2 event is  used to evaluate the events in the cache as soon as it is
> received.    The Slot 2 events are not cached at all, and once used, they
> are
> discarded unless you added additional processing for them, which is why
> your
> ruleset doesn't handle your second case.  If no matches are received during
> the
> time interval, the cache is flushed, and the appropriate action taken for
> the
> Slot 1 event -- for Reset, it is passed along to the next ruleset node, for
> Pass, it is dropped.
>
> James Shanks
> Tivoli (NetView for UNIX) L3 Support

--
Ray Schafer                   | schafer AT tkg DOT com
The Kernel Group              | Distributed Systems Management
http://www.tkg.com