RE: [nv-l] nvtecia still hanging or falling behind processing TEC

Thank you for replying,

I just finished researching the tracing capability you mention James--turning it on now to see if it's going that far or not. I hope it doesn't impact performance much as it's something that seems valuable to keep on 100%, much like logging TEC events. This alone should tell us where the holdup is and cut troubleshooting in half.

We have two thresholds for how we react to not seeing the hourly HB in a timely fashion. If it shows up before the next hourly heartbeat, it is deemed OK (slowdown), but the Ops folks know they have to start relying more on Event Browser until things sync back up. If the HB does not appear after the one hour, we are pretty sure it is hung, my team is paged, and we cycle daemons. We can't watch any longer than that because it's likely events have been missed.

The trap surges aren't anything near the sustained 6-8 traps/sec. that NV is rated for. What we usually see are surges of traps over an hour, each trap within 4-5 seconds of each other--usually when an ATM VC is flipping between up/down. We think this should be easily handled, especially based on our TEC_ITS load testing performed a few months ago. That testing showed we could handle

Test Two—trapd and TEC_ITS flood. Success criteria: No degradation in console or command line performance during test. No loss of TEC events, and no more than a 2 minute delay before events are seen at TEC console

250 traps within 5 seconds every 10 minutes for 2 hours—pass
1,000 traps within 20 seconds every 10 minutes for 2 hours—trapd passed, TEC failed. No events were lost, but they were queued for up to 30 minutes while netview.rls performed correlation
300 traps within 5 seconds every 5 minutes for 24 hours—pass (close to practical max)

[Application Maximum] 7 traps per second for 2 hours—halted, TEC queuing 2000 events within 10 minutes. No events lost

Our stress testing used Interface Down traps, so it was much more strenuous. Cisco traps don't even get to netview.rls correlation, they just use TEC_ITS to get to TEC. It's bizarre how we could hammer it for hours with internal NV traps but fewer external Cisco traps seem to choke it. The NV ruleset node used for Cisco traps appears fine when you look at it.

Thanks again--Drew

-----Original Message-----
From: owner-nv-l AT lists.us.ibm DOT com [mailto:owner-nv-l AT lists.us.ibm DOT com] On Behalf Of James Shanks
Sent: Tuesday, September 14, 2004 10:11 AM
To: nv-l AT lists.us.ibm DOT com
Subject: Re: [nv-l] nvtecia still hanging or falling behind processing TEC_ITS.rs

I'm not aware of anyone else reporting a similar problem. Historically, however, the adapter has always been load sensitive.
But let's clarify the issue a bit, shall we? Are you saying that the adapter slows down or that it hangs? Does the heartbeat event get there eventually? How slow is it? Do things ever recover without your taking everything down or not? How long does that take? How big is this trap surge you are talking about?

There is no simple way to diagnose this issue because there is the ZCE engine in the middle, as well as the fact that nvserverd has no idea what's going on after he does tec_put_event. As far as NetView is concerned, once that occurs, the event has been sent. Whether it gets to the server or not is the responsibility of the code in the TEC EEIF library. You can use the conf file entry NvserverdTraceTecEvents=YES, or the corresponding environment variable, to get an nvserverd.log, to see whether nvserverd has given the event to the adapter in a timely fashion. Then you would have to check the adapter's cache file, by default /etc/Tivoli/tec/cache, and see whether it is caching events. It will do that if communications with the server hiccup. But it should recover from that automatically. When communication is lost, it tries again on every subsequent event. If the cache isn't growing, and nvserverd has logged the event, then the problem is internal to the TEC code. To go deeper, you'd have to get the TEC folks involved.

They might want you to get the java adapter traces mentioned in the conf file, or they might want a trace of the internals of the adapter library. For that you'd have to obtain a special diagnosis file from them, called ".ed_diag_conf" to hook that in by a special entry in the conf file. But then they'd have to read the traces. And all that would require that you open a call to Support.

James Shanks
Level 3 Support for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group

"Van Order, Drew \(US - Hermitage\)" <dvanorder AT deloitte DOT com>
Sent by: owner-nv-l AT lists.us.ibm DOT com
09/14/2004 10:22 AM

Please respond to
nv-l

To
<nv-l AT lists.us.ibm DOT com>

cc

Subject
[nv-l] nvtecia still hanging or falling behind processing TEC_ITS.rs

Hi all,

After patching 7.1.4 FP01 with the latest efix to fix nvcorrd/nvtecia hanging or stalling, we find it's still happening. It mainly starts when we get a surge of Cisco syslog traps from devices. The only piece not keeping up is the NV to TEC integration; demandpolls are fine and events are moving in the Event Browser. TEC_ITS only passes traps on, we do no other processing in the ruleset. TEC events from sources outside NV are not impacted. We send an hourly Interface Down trap via cron to serve as a heartbeat. When it misses the second one in a row (as seen at TEC), we cycle NV and it's OK again. MLM is not an option for our environment. Is anyone else struggling with this?

Thanks--Drew

*Disclaimer:*
This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited.

RE: [nv-l] nvtecia still hanging or falling behind processing TEC_ITS.rs