Re: [nv-l] Loss of traps with MLM

I am not the MLM guy, but I do know that there is no way, documented nor 
undocumented, to alter any buffer sizes it uses without a code change.  So 
what you are looking for doesn't exist.  Yet I am also not sure what to 
say about your problem, because this is the first time I have ever heard 
of MLM being accused of losing traps.   Perhaps I should also point out 
that NetView and MLM share no code whatsoever.  If they did, then we could 
not have an MLM on HP/UX.  That would be prohibited by our original 
purchase agreement with HP, just as a NetView for HP is.    So MLM, while 
it is shipped with NetView these days, remains a completely separate 
product, code-wise.  You cannot assume that a feature on one is the same 
as on the other nor that you can willy-nilly substitute one for the other 
and achieve the same result.

I am rather curious about your test procedure, since the command 
"send_event" is not shipped by either NetView nor MLM itself.  What  does 
it do?  Is it a command to MLM or to NetView?   Did you write it yourself? 
  Does it have tracing or error logging associated with it?   The reason I 
ask is that it seems to me that  if it opens a TCP connection to MLM to 
cause the event to be sent, it may very well be that under the conditions 
of your test, MLM was often too busy to open that connection, and thus it 
may be that he did not lose any traps, but rather failed to send them in 
the first place.    There is a BIG difference between the two.   If he 
failed to send them, then perhaps you just need better error checking in 
your command. 

Also I am curious about your 1 second disarm timer.  Since neither MLM nor 
NetView for UNIX is multi-threaded, it can only do one thing at time.  If 
you raise your limits, does the problem disappear?  Even allowing one node 
to send 20 traps per second is a sure way to bring your NetView processing 
to a crawl, so this is not an unreasonable thing to do.   Your NetView 
events GUI will begin to flicker at about 5 events per second if you 
display them, and without the trap pruning added in NetView Version 6 (not 
sending unnecessary traps to the daemons who don't need them) your netmon 
and ovtopmd will start falling far behind and may never catch up unless 
re-cycled.  They may just disconnect from trapd.  And when that happens 
ovtopmd will stop  and wait for you to re-connect with ovstart.

I am not certain about what anyone can do for you under the circumstances 
you describe.  The code you have is out of support and a performance issue 
involving it cannot be officially pursued.  And it seems clear to me that 
unless it is so pursued, with other people trying to duplicate your 
results, there is very little that can be done, except to tell you that 
you will have to live within the limits of the code you have.   Sorry, but 
I see no alternatives.

James Shanks
Level 3 Support  for Tivoli NetView for UNIX and NT
Tivoli Software / IBM Software Group
 





Robin James <robin.james AT thalesatm DOT com>
04/02/2002 09:58 AM

 
        To:     NetView Discussion <nv-l AT lists.tivoli DOT com>
        cc: 
        Subject:        [nv-l] Loss of traps with MLM

 

We have been performing an experiment to determine if it is possible for
our Netview computer to lose locally generated traps. 

We use Netview 5.1.3 on Compaq TRU64 UNIX and we also run MLM on the
same machine to use its filtering capability. We have setup a filter to
throttle traps with the following settings:

smMlmFilterName[BlockTrapFlooding] =  "BlockTrapFlooding"
smMlmFilterState =  enabled
smMlmFilterDescription =  "Blocks traps when too many traps come from
the same host in a short time"
smMlmFilterAction =  throttleTraps
smMlmFilterAgentAddrExpression =  "cwps"
smMlmFilterThrottleType =  sendAfterN
smMlmFilterThrottleArmTrapCount =  20
smMlmFilterThrottleArmedCommand =  "/usr/sbin/Mlm_stop_snmpd.sh
$SM6K_TRAP_AGENT_ADDRESS"
smMlmFilterThrottleDisarmTimer =  "1s"
smMlmFilterThrottleDisarmTrapCount =  0
smMlmFilterThrottleDisarmedCommand =  "snmptrap -p 1675 localhost omc
.1.3.6.1.4.1.1254.1 `hostname` 6 104 1 .1.3.6.1.2.1.1.5 OctetStringASCII
$SM6K_TRAP_AGENT_ADDRESS"
smMlmFilterThrottleCriteria =  byNode
smMlmAliasName[cwps] =  "cwps"
smMlmAliasList =  "w1161,
w1162,
w2142"

As you can see from the settings an alias is also setup so that the
traps generated on the Netview node are not subject to the filter.

We set up 3 nodes to send traps repeatedly using the following script:

while 1
   send_event 803 "swamp test"
end

This gave approximately 2200 traps in the trapd log in one minute. Using
vmstat it could be seen that the Netview node had very little idle time.
We then used send_event on the Netview node to send single traps. We
observed that 1 out 4 events was not present in the trapd or midmand
logs.

This seems to confirm that a trap can be lost when the Netview node is
receiving a very heavy load of traps.

We also performed the same test by removing the use of MLM so traps go
directly to trapd and not via midmand. We performed the same test and
found no traps were lost.

It appears to me that the buffering between UNIX and trapd does not lose
the locally generated events but when MLM is filtering it is possible to
lose traps. 

Is it possible to find out the UDP receive buffer size with each
configuration?

I realise that the source of the problem is the node flooding our
Netview node with traps. We must stop this node from sending so many
events. We are trying to put a two part solution in place to ensure we
do not lose locally generated traps. The two parts are:

1. When MLM detects a node is flooding Netview with traps we will freeze
snmpd on that node so traps do not get sent.
2. Increase the buffer size between MLM and UNIX.

We think we know what to do for part 1 of the solution but is it
possible to increase the buffer size between midmand and UNIX for
receipt of traps? I know trapd provides an option to specify a UDP
receive buffer size but I can't see a similar option for midmand.

I would appreciate any comments or help on this problem.
Thanks

-- 
Robin
email: robin.james AT thalesatm DOT com
tel:   +44 (0) 1633-862020
fax:   +44 (0) 1633-868313

---------------------------------------------------------------------
To unsubscribe, e-mail: nv-l-unsubscribe AT lists.tivoli DOT com
For additional commands, e-mail: nv-l-help AT lists.tivoli DOT com

*NOTE*
This is not an Offical Tivoli Support forum. If you need immediate
assistance from Tivoli please call the IBM Tivoli Software Group
help line at 1-800-TIVOLI8(848-6548)