Thank you for replying,
I
just finished researching the tracing capability you mention James--turning
it on now to see if it's going that far or not. I hope it doesn't impact
performance much as it's something that seems valuable to keep on 100%, much
like logging TEC events. This alone should tell us where the holdup is and cut
troubleshooting in half.
We
have two thresholds for how we react to not seeing the hourly HB in a timely
fashion. If it shows up before the next hourly heartbeat, it is deemed OK
(slowdown), but the Ops folks know they have to start relying more on Event
Browser until things sync back up. If the HB does not appear after the one hour,
we are pretty sure it is hung, my team is paged, and we cycle daemons. We
can't watch any longer than that because it's likely events have been
missed.
The
trap surges aren't anything near the sustained 6-8 traps/sec. that NV is rated
for. What we usually see are surges of traps over an hour, each
trap within 4-5 seconds of each other--usually when an ATM VC is flipping
between up/down. We think this should be easily handled, especially based on our
TEC_ITS load testing performed a few months ago. That testing showed we could
handle
Test Two—trapd and TEC_ITS flood.
Success criteria: No degradation in console or command line performance during
test. No loss of TEC events, and no more than a 2 minute delay before events are
seen at TEC console
- 250
traps within 5 seconds every 10 minutes for 2 hours—pass
- 1,000
traps within 20 seconds every 10 minutes for 2 hours—trapd passed, TEC failed.
No events were lost, but they were queued for up to 30 minutes while
netview.rls performed correlation
- 300
traps within 5 seconds every 5 minutes for 24 hours—pass (close to practical
max)
[Application
Maximum] 7 traps per second for 2 hours—halted, TEC queuing 2000 events within
10 minutes. No events lost
Our
stress testing used Interface Down traps, so it was much more strenuous. Cisco
traps don't even get to netview.rls correlation, they just use TEC_ITS to get to
TEC. It's bizarre how we could hammer it for hours with internal NV traps
but fewer external Cisco traps seem to choke it. The NV ruleset node used for
Cisco traps appears fine when you look at it.
Thanks again--Drew
I'm not aware
of anyone else reporting a similar problem. Historically,
however, the adapter has always been load sensitive. But let's clarify the issue a bit, shall we? Are
you saying that the adapter slows down or that it hangs? Does the
heartbeat event get there eventually? How slow is it? Do things
ever recover without your taking everything down or not? How long does
that take? How big is this trap surge you are talking about?
There is no simple way to diagnose this
issue because there is the ZCE engine in the middle, as well as the fact that
nvserverd has no idea what's going on after he does tec_put_event. As
far as NetView is concerned, once that occurs, the event has been sent.
Whether it gets to the server or not is the responsibility of the code
in the TEC EEIF library. You can use the conf file entry
NvserverdTraceTecEvents=YES, or the corresponding environment variable, to get
an nvserverd.log, to see whether nvserverd has given the event to the adapter
in a timely fashion. Then you would have to check the adapter's
cache file, by default /etc/Tivoli/tec/cache, and see whether it is caching
events. It will do that if communications with the server hiccup.
But it should recover from that automatically. When communication
is lost, it tries again on every subsequent event. If the cache isn't
growing, and nvserverd has logged the event, then the problem is internal to
the TEC code. To go deeper, you'd have to get the TEC folks
involved.
They might want
you to get the java adapter traces mentioned in the conf file, or they might
want a trace of the internals of the adapter library. For that you'd
have to obtain a special diagnosis file from them, called
".ed_diag_conf" to hook that in by a special entry in the conf file.
But then they'd have to read the traces. And all that
would require that you open a call to Support.
James Shanks Level 3 Support for Tivoli
NetView for UNIX and Windows Tivoli Software / IBM Software Group
"Van Order, Drew \(US -
Hermitage\)" <dvanorder AT deloitte DOT com> Sent by: owner-nv-l AT lists.us.ibm DOT com
09/14/2004 10:22 AM
|
To
| <nv-l AT lists.us.ibm DOT com>
|
cc
|
|
Subject
| [nv-l] nvtecia still
hanging or falling behind processing
TEC_ITS.rs |
|
Hi all,
After patching 7.1.4 FP01 with the
latest efix to fix nvcorrd/nvtecia hanging or stalling, we find it's still
happening. It mainly starts when we get a surge of Cisco syslog traps from
devices. The only piece not keeping up is the NV to TEC integration;
demandpolls are fine and events are moving in the Event Browser. TEC_ITS only
passes traps on, we do no other processing in the ruleset. TEC events from
sources outside NV are not impacted. We send an hourly Interface Down trap via
cron to serve as a heartbeat. When it misses the second one in a row (as seen
at TEC), we cycle NV and it's OK again. MLM is not an option for our
environment. Is anyone else struggling with this? Thanks--Drew
*Disclaimer:*
This message (including any attachments)
contains confidential information intended for a specific individual and
purpose, and is protected by law. If you are not the intended recipient, you
should delete this message. Any disclosure, copying, or distribution of this
message, or the taking of any action based on it, is strictly
prohibited.
|