RE: [nv-l] nvtecia still hanging or falling behind processing TEC_ITS.rs
2004-09-17 17:22:58
Drew,
Performance problems are notoriously
difficult to diagnose, especially remotely. Remember too that the
benchmarks you are thinking of are for optimally configured systems running
in the lab, not real-world results. But here's a couple of points
you might investigate.
(1) What you see in trapd.log is not
necessarily what is coming in. It's what trapd.processed and logged.
Logging is the last thing trapd does with the trap, after he's processed
it in every other way. It does record that a particular trap was
received and processed at a particular time, but that's about all. So
seeing Cisco traps in trapd.log 5 seconds apart means that's how fast trap
is processing them, not how fast they are arriving. What might you
not see in the log? Any traps configured to "Don't Log or Display"
in xnmtrap. That action puts the trap category in trapd.conf to "Ignore".
So you could go to /usr/OV/conf/C (don't forget the "C"
here) and do "grep Ignore trapd.conf " and see whether you have
any of those. If you do, then you are not seeing those in the log.
For diagnostic purposes you should alter those entries to "Log
Only" so you can get a better idea of the work trapd is actually doing.
(2) To get closer to what is coming
in, you could turn on the trapd.trace. You'll see a message about
each trap being received from address so-and-so every time one is
pulled off the queue for processing. If you want to see the contents
of those incoming traps, then you also need to have trapd running with
the -x option to hex dump an incoming packets. Now I said closer
to what is coming in, because obviously trapd will cannot trace a trap
until he has started to read it. When won't he read immediately? When
there is no break between incoming traps. If traps arrive too
quickly, rather than pull them off one at a time and process them, trapd
queues them so that he doesn't lose any. He won't start processing
them again until there's a break in the incoming flow. In that case
you should see a bunch of trap queued messages but no intervening processing
in the trace. I suspect that this is really what's going on. You
get a big burst of traps, so all trap processing slows while we queue them,
and then once the burst subsides, processing starts up again. But
now the bottle neck is going to be in nvcorrd and nvserverd, who have been
idle for awhile, and now have a lot to do. It's like a snake swallowing
an egg; you see a big lump moving along until it is totally digested.
You have to turn on the nvcorrd trace (nvcdebug -d all) to see what
nvcorrd's doing, and one benefit of that is that you can see how long it
takes him to process just one trap, given the rulesets and event windows
you have going at the time. Look for the eye-catcher "Received
a trap" and "Finished with the trap". From the one
to the other is the transit time through nvcorrd. Not much you can
do if you don't like it, other than to reduce the load.
(3) Obviously if you want to assess
what the real incoming trap rate is, you need an outside analysis tool,
such as an iptrace for port 162. Then you can use ipreport of the
data and see. Those are AIX commands by the way -- there are similar tools
on Solaris and Linux but I haven't used them much.
(4) If you cannot reduce the incoming
rates to keep processing from being overloaded then you might consider
installing an MLM and using it as a trap filter, tossing out duplicates
and only passing on to trapd what you really want to see.
HTH
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
"Van Order, Drew \(US
- Hermitage\)" <dvanorder AT deloitte DOT com>
Sent by: owner-nv-l AT lists.us.ibm DOT com
09/17/2004 10:12 AM
|
To
| <nv-l AT lists.us.ibm DOT com>
|
cc
|
|
Subject
| RE: [nv-l] nvtecia still
hanging or falling behind processing TEC_ITS.rs |
|
James,
We finally had missed heartbeats
to track. We can see the heartbeat trap in trapd.log, but no corresponding
entry in nvserverd. This appears to confirm the holdup is on the NV side,
and again, we had an increase in Cisco traps (one every 5 seconds for about
2 hours prior to missing the first heartbeat), but nothing near NV's limit.
Trapd.log shows it is starting to fall behind as well during this period--as
an example, the missed heartbeat TEC event for 6 PM last night did not
show in trapd until 6:48 PM. The 7 PM heartbeat shows in trapd at 7:21
PM and is in nvserverd at 7:45, so it had almost caught up by then.
So the TEC adapter never stopped,
but we've got to figure out why trapd and the processes in between seem
to stumble under load, but not a heavy one. We know Cisco devices can send
some traps at rates faster than one per second. Is it possible devices
are machine gunning traps even though NV shows one every 5 seconds or so?
That's the only thing I can think of that could set trapd behind based
on what we are seeing.
Thanks everyone--Drew
-----Original Message-----
From: owner-nv-l AT lists.us.ibm DOT com [mailto:owner-nv-l AT lists.us.ibm DOT com]
On Behalf Of James Shanks
Sent: Thursday, September 16, 2004 3:40 PM
To: nv-l AT lists.us.ibm DOT com
Subject: RE: [nv-l] nvtecia still hanging or falling behind processing
TEC_ITS.rs
So what's different? Is your wpostemsg to @EventServer like your
tecint.conf file?
We are back to this being a TEC issue and not a NetView one. So unless
you want to open a problem to TEC support, you'll have to do some more
detective work yourself.
If both the wpostemsg and the tecint.conf have @EventServer, then I don't
know what to tell you. If not, then reconfigure your tecint.conf
using serversetup to use the non-TME method (which requires that a different
daemon be started than when you use the TME method). For non-TME
forwarding, /usr/OV/bin/nvserverd is started. For TME forwarding,
it is /usr/OV/bin/spmsur, who then starts /usr/OV/bin/tme_nvserverd. To
which from one to the other requires that you go through serversetup, which
will reconfigure this automatically, or that you manually alter
the /usr/OV/conf/ovsuf file to start the correct daemons. But note
that when you go through serversetup, your special customization to the
Nvserverd entries is lost.
The fact that events are going to the cache means that nvserverd got the
event, formatted it, did his tec_put_event( ) and all went fine, but then
TEC library code in trying to send to the TEC server found that it could
not, that it has lost connection to the TEC server, for some reason known
only to those internal routines. And without a diag (as in "diagnosis")
file configured in here so that the internal TEC library code will trace
itself, no one can tell you what it's doing or why. And you have
to get that diag file, called ".ed_diag_config" from TEC Support
and they are the ones who have to look at the traces. No one on the
NetView side can assist at this point.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
"Edwards, JT - ESM"
<JEdwards3 AT wm DOT com>
Sent by: owner-nv-l AT lists.us.ibm DOT com
09/16/2004 04:00 PM
|
To
| "'nv-l AT lists.us.ibm DOT com'"
<nv-l AT lists.us.ibm DOT com>
|
cc
|
|
Subject
| RE: [nv-l] nvtecia still
hanging or falling behind processing TEC
_ITS.rs |
|
Yes it does.
-----Original Message-----
From: owner-nv-l AT lists.us.ibm DOT com [mailto:owner-nv-l AT lists.us.ibm DOT com]On
Behalf Of James Shanks
Sent: Thursday, September 16, 2004 2:32 PM
To: nv-l AT lists.us.ibm DOT com
Subject: RE: [nv-l] nvtecia still hanging or falling behind processing
TEC _ITS.rs
Wpostemsg does not go through the internal adapter. Does that get
to the TEC server?
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
"Edwards, JT - ESM"
<JEdwards3 AT wm DOT com>
Sent by: owner-nv-l AT lists.us.ibm DOT com
09/16/2004 03:17 PM
|
To
| "'nv-l AT lists.us.ibm DOT com'"
<nv-l AT lists.us.ibm DOT com>
|
cc
|
|
Subject
| RE: [nv-l] nvtecia still
hanging or falling behind processing TEC
_ITS.rs |
|
Well at this point. We are now getting events caching. From there what
can we do?
A wpostemsg does not clear the cache.
-----Original Message-----
From: owner-nv-l AT lists.us.ibm DOT com [mailto:owner-nv-l AT lists.us.ibm DOT com]On
Behalf Of James Shanks
Sent: Wednesday, September 15, 2004 10:16 PM
To: nv-l AT lists.us.ibm DOT com
Subject: RE: [nv-l] nvtecia still hanging or falling behind processing
TEC _ITS.rs
No. The errno 827 indicates that there is a problem initializing
the JVM -- Java Virtual Machine. In almost every case I have seen this
indicates that the nvserverd daemon does not have the correct library path
for Java or the ZCE_CLASSPATH variable is not set. Since it is only set
in /etc/netnmrc, if you ovstop all the daemons and restart them with just
ovstart, you will lose it. So Mike is right. The usual fix is to ovstop
nvsecd and then restart with /etc/netnmrc (/etc/init.d/netnmrc on Solaris
or Linux). This issue has been fixed in the upcoming FixPack 2 (FP02) by
updating the NVenvironment script so that if you run that before you do
ovstart, it will source the correct environment for you, and then the daemons
will inherit it when you do the ovtstart.
But I still don't know why you are not getting an nvserverd.log which shows
the same tec_create_handle failure that you see in the formatted nettl.
We do get that here.
James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group
This message (including any attachments) contains confidential
information intended for a specific individual and purpose, and is protected
by law. If you are not the intended recipient, you should delete this message.
Any disclosure, copying, or distribution of this message, or the taking
of any action based on it, is strictly prohibited.
|
|
|