Bill,
This may or may not be related, but we
had a problem here with a earlier version of the net-SNMP code, are you
are at net-snmp-5.0.9-2-2.3E.6 or later on the NetView machine? (in Release
Note for FP2)
It sounds like you may have a time-out
of snmp communication to devices from NetView SLES machine, this
could be either NetView, or the device causing the time-out.
You may want to turn on netmon tracing
"netmon -M 63", and tail -f /usr/OV/log/netmon.trace to determine
if netmon is still polling the devices in question. You may want to try
this both before problem happens, and when problem is occurring to see
if you can determine why the time-outs are occurring
The different between QuickTest and Demand
Poll, is that Quick test only goes after interface status, where Demand
Poll goes after a large set of data.
Check the the IP and Interface tables from
device are responsing fuilly, snmpwalk the device looking for ip and interface
tables only, make sure they complete.
Also be aware that net-snmp also provides
a snmpwalk command that is different from NetView. NetView will use
the one in /usr/OV/bin. Which did you use. I have found that by using
both, I can sometime locate problem that one or the other would not find.
Also at the time of pause in demand poll,
what is the state of the device, ? high CPU usage?
Hope thie helps.
Mark F Sklenarik IBM SWG
Tivoli Solutions Business Impact Management and Event Correlation
Software Quality Engineer IBM Corporation
"Evans, Bill"
<Bill.Evans AT hq.doe DOT gov>
Sent by: owner-nv-l AT lists.us.ibm DOT com
03/16/2005 10:49 AM
|
To
| nv-l AT lists.us.ibm DOT com
|
cc
|
|
Subject
| [nv-l] Problems with SNMP
monitoring |
|
I'm having a problem with the migration of
NetView to a new machine.
This is a new SUSE SLES 9 installation of
NetView 7.1.4 FP 2 on a Dell 1750 with manual transfer of seed, community
strings, hosts, location.conf and other configuration data. We
are in a "test"
mode. It is
using net-SNMP. Our
old system is a SUN with NV 7.1.3 and
current fixpacks. It
uses the SUN SNMP. We
staged the bring up
of the new machine to verify it's capacity and clean up the messy existing
configuration. Our first pass was to bring across the routers, then
the switches, then the servers we monitor and finally any
local extensions. We're there with
the full NetView device load.
The area which is giving
us problems
is the SNMP management of Routers. This includes
15 core network routers, 15 MAN routers
and 37 Wide Area Network routers. Core
Routers are Cisco 6000 and 7000 models.
WAN routers are Cisco
3800 series. MAN routers are all over the place from
Cisco 2500
through 7500 models.
The OLD machine
is giving us fits with what appears to be dropped SNMP responses. The
particular ones giving
trouble are the WAN
devices although the loss of responses also hits the core routers on occasion.
It would appear that the SUN SNMP subsystem is
swallowing some
responses (randomly but tending toward the last ones received for
the devices affected). This
began after we added a hundred or so HSRP interfaces to our core configuration.
These false
alarms upset our management
team and we're trying to address it by moving to a new box.
The new box works
well (most of the
time) for these devices. When it is working it gives a reliable view
of the state of the WAN routers. The "lost
responses" are not a problem on the new machine. Occasionally
(about every 32 hours for the past couple days) a portion of the WAN if
not all of it goes critical with SNMP polling timeouts.
When it happens, all the affected
routers fail at the same time. Until
reset manually they
will not recover. One
or more core routers may also be hit.
·
PING
will work to the devices on either loopback or active port address but
the device state will return to Critical because the next SNMP poll will
fail.
·
SNMP
polling is in use because the router configuration has a delay defined
on one port (backup circuit) which prevents successful ICMP polling.
·
QuickTest
and QuickTest Critical will NOT work after the initial failure. The
result is an SNMP timeout.
·
Demand Poll will work. This
resets whatever is ailing and all works well for another day.
·
During the Demand Poll there is often
a significant pause (up to one minute) after
we see the "Get
CDP Cache entry" line and sometimes another when we see the
"Get MPLS MIB" line.
·
The
other machine is having no problems with its SNMP polling except for the
continuing false alarms.
As you can guess this 32 hour cycle slows
debugging. A
couple days ago I did an SNMP Walk on the devices but I'm not sure if it
worked or didn't. Next
time I get a failure I plan to dig into that
issues. Meanwhile
I haven't been able to find anything on the archives of in the knowledge
base which appears to be similar.
I don't feel I have enough to go on to open
an incident yet and hope the "communal
wisdom" may point me in the right direction.
My current hypothesis:
·
The
problem has to be in the NetView at the new machine.
Suggestions and comments are solicited.
Bill Evans
|