nv-l

Re: [nv-l] appl queue size question

2005-05-25 11:15:18
Subject: Re: [nv-l] appl queue size question
From: Francois Le Hir <flehir AT ca.ibm DOT com>
To: nv-l AT lists.us.ibm DOT com
Date: Wed, 25 May 2005 11:14:38 -0400



James,

I have seen the question on the list recently but didn't see any answer and
I think it is related to this thread.
Every day I get several netview traps IBM_NVFERR_EV (specific 58851330)
with messages like this:

servmon probably died_ ungracefully disconnected from trapd
netmon probably died_ ungracefully disconnected from trapd
netmon\-related Application reached maximum number of outstanding events_
disconnecting from trapd\.

However no daemon seams to fail or they restart by themselves as I don't
have to restart anything.

A while ago I tried to address this issue with support (to at least
understand the meaning of theses traps) and what I was told is to increase
the "appl queue size". It is now set to 25000 on my system and even if it
(probably) reduced the number of traps I am getting, I still get some
almost every day.
Does this high number of 25000 do any good ?
Running the same (or similar) script as Scott never show and high value for
the queue usage.

Thanks
Salutations, / Regards,

Francois Le Hir
Network Projects & Consulting Services
IBM Global Services
Phone: (514) 964 2145


                                                                           
             James Shanks                                                  
             <jshanks AT us DOT ibm.c                                           
  
             om>                                                        To 
             Sent by:                  nv-l AT lists.us.ibm DOT com             
  
             [email protected]                                          cc 
             us.ibm.com                                                    
                                                                   Subject 
                                       Re: [nv-l] appl queue size question 
             05/25/2005 10:23                                              
             AM                                                            
                                                                           
                                                                           
             Please respond to                                             
                   nv-l                                                    
                                                                           
                                                                           




Scott,

I hesitate to say it, but the phrase, "Luke, you are messing with powers
you cannot possibly understand," comes to mind.
(Guess which movie we saw recently?)   And I'll apologize now for that
feeble attempt at humor, while I attempt to answer your question.  Of
course, you can understand, once someone explains what you are actually
looking at.  So here goes.

Basically, you have an application queue size of 5000 events, period.  The
55042 is a process id, and is irrelevant.  That's all the trace tells you
at this time, except that the queues are not backed up, since you are
seeing  one event being added, and then immediately deleted.  Running this
script when you actually have a problem with events being behind might tell
you how close the appl queues are to being full, but running it now when
you don't have a problem, tells you nothing much.  By itself,  this script
is not a performance  analysis tool, but only a diagnostic aid.

Now, since the default application queue size in trapd is 2000 events,
yours has already been changed at least once and is more than double the
usual amount.  Apparently someone has been tuning this before.   So what
problem are you trying to solve, what symptoms are you seeing?

This queue size determines how many events trapd will pass to connected
application which is not responding (or responding too slowly) before he
closes their socket connection to him.  He does so in order to avoid his
own demise from lack of storage.   Usually, the only reason to alter this
size is that you have periodic traps storms, so the connected applications
get a whole bunch of traps all at once, after the storm initially subsides,
and now they have a lot to do to catch up.  So you raise the size of the
queues to hold more events so they can do that.  Otherwise, they get forced
off and all the events in the queue for them are discarded.  Sometimes that
really is the best thing to do, let them get forced off, and sometimes not.
It's a trade-off.  If they don't get forced off, then they get backed up,
and it may take while awhile for them to catch up.

Unfortunately. there is no tool I know of which can tell you how big you
should make the application queue size if you don't  want the appls forced
off.   And I should know.  I'm responsible for trapd maintenance.  Like
most tuning issues, picking an application queue size other than the
default is a trial-and-error business.

James Shanks
Level 3 Support  for Tivoli NetView for UNIX and Windows
Tivoli Software / IBM Software Group



             "Bursik, Scott
             {PBSG}"
             <[email protected]                                          To
             g.com>                    "'Nv-L (nv-l AT lists.us.ibm DOT com)'"
             Sent by:                  <nv-l AT lists.us.ibm DOT com>
             [email protected]                                          cc
             us.ibm.com
                                                                   Subject
                                       [nv-l] appl queue size question
             05/25/2005 09:15
             AM


             Please respond to
                   nv-l






All,

I am having some performance issues with my production NetView server and
in
an effort to diagnose the issue I ran a script that someone from the forum
contributed a while back. It checks the appl queue.

When I run the script I get the following output:

Turning on trapd tracing
Starting tracing now....
Toggling trace mode of SNMP trap daemon
Waiting for one minute---------------------------|
.................................................
Stopping Tracing....
Toggling trace mode of SNMP trap daemon
Getting trapd status from /usr/OV/log/trapd.trace
Wed May 25 08:04:17 2005 send_to_all_appls: [0] appl queue size 1 of
maximum
5000 events
Wed May 25 08:04:17 2005 send_to_all_appls: [55042] appl queue size 1 of
maximum 5000 events

Should I be concerned with the last line? If I am reading this correctly I
am configured for a max of 5000 events and I have a queue size of 55042. I
would say that the appl queue size needs to be changed. We have a very
large
environment.


Here is the script so you can see what it is doing:

#!/usr/bin/ksh
clear
echo > /usr/OV/log/trapd.trace
echo "Turning on trapd tracing"
        echo ""
        echo ""
echo "Starting tracing now...."
/usr/OV/bin/trapd -T
        echo ""
        echo ""
# Progress indicator
while :; do
        sleep 1
        echo ".\c"
done &
Progress=$!
trap 9 15 "kill $Progress;exit 1"
echo "Waiting for one minute---------------------------|"
sleep 50
kill $Progress
        echo ""
        echo ""
echo "Stopping Tracing...."
/usr/OV/bin/trapd -T
        echo ""
        echo ""
echo "Getting trapd status from /usr/OV/log/trapd.trace"
tail /usr/OV/log/trapd.trace | grep "appl queue size"
Thank You!

Scott Bursik







<Prev in Thread] Current Thread [Next in Thread>