Veritas-bu

Re: [Veritas-bu] Long time between "requesting resources" and "grantingresources"

2008-07-10 18:40:27
Subject: Re: [Veritas-bu] Long time between "requesting resources" and "grantingresources"
From: <mikemclain AT northwesternmutual DOT com>
To: <Mark.Donaldson AT cexp DOT com>, <Veritas-bu AT mailman.eng.auburn DOT edu>
Date: Thu, 10 Jul 2008 17:23:07 -0500
Mark,

The reason that I have a little more detail in the logs is that we went
through several nbrb binaries troubleshooting this and engineering
needed some additional stats during the evaluation cycle to help them
refine their caching algorithms.    These expanded stats should be
available in the 6.5.2 nbrb.

When we first started troubleshooting this, the glaring pain points are
when eval cycles take 20+ minutes, but the additional data we would
gather would be based the missed cache hits (numMdsCacheMissed).   Each
missed cache hit causes nbrb to go through nbemm and query the EMM
database for the drive information.   When you have several hundred
misses it would drive up the msec total.

When jobs initially comes into the queue, there will be a "missed cache"
event, but subsequent evaluations by nbrb shouldn't slow it down.    

0,51216,118,118,3339896,1215328612245,22178,10,0:,354:Resource
evaluation completed. Evaluated 1594 requests, evaluation time: 816012
msec, <EVALPERF> numSucceeded=111, numQueued=1483, numPended=0,
numFailed=0, numMultipleCopyJobs=203, numMdsCacheMisses=4,
numMDARReleaseDuringEval=6, restartedEvalCycle=1,
numDriveReleasesBeforeEval=130, numDrivesAvailStart=185,
numDrivesAvailEnd=204,
requestQueueSize=1483,25:ResBroker_i::doEvaluation,1

0,51216,118,118,3342623,1215329883689,22178,9,0:,358:Resource evaluation
completed. Evaluated 2264 requests, evaluation time: 1271441 msec,
<EVALPERF> numSucceeded=276, numQueued=1988, numPended=0, numFailed=0,
numMultipleCopyJobs=200, numMdsCacheMisses=656,
numMDARReleaseDuringEval=31, restartedEvalCycle=1,
numDriveReleasesBeforeEval=123, numDrivesAvailStart=187,
numDrivesAvailEnd=169,
requestQueueSize=1988,25:ResBroker_i::doEvaluation,1  

The above data shows some of our larger runs with 2000+ requests, but
nbrb.conf tweaks can alleviate some of this and make the msec time not
as meaningful.

Along with the caching that was done for ITC jobs and STU groups, the
nbrb.conf tweaks allows nbrb to release and reuse resources during the
whole evaluation cycle so even if the cycle takes 20+ minutes, jobs are
still firing off during that time and reusing resources that were
released.    

I don't believe a technote has been published on this yet describing all
the changes in nbrb, but it has made a huge difference in our
environment.

-Mike


-----Original Message-----
From: Mark.Donaldson AT cexp DOT com [mailto:Mark.Donaldson AT cexp DOT com] 
Sent: Thursday, July 10, 2008 2:54 PM
To: MCLAIN, MIKE; Veritas-bu AT mailman.eng.auburn DOT edu
Subject: RE: [Veritas-bu] Long time between "requesting resources" and
"grantingresources"

This is great, Mike.

You said you graphed the pain points?  What numbers in that log record
reflect painful values?

I'm at 6.5.1 on my code - I think you gave me a reason to accelerate
6.5.2.

My "msec" numbers are significantly more than your numbers below,
between 10 and 22 minutes for my most recent log file.  FWIW, My logging
seems to be less verbose than yours, too:

> grep msec /usr/openv/logs/51216-118-2200013915-080710-0000000002.log
0,51216,118,118,2302499,1215708383105,19630,12,0:,124:Resource
evaluation completed. Evaluated 308 requests, sent resource pending for
175 requests, evaluation time: 1167778
msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2303108,1215709606394,19630,11,0:,122:Resource
evaluation completed. Evaluated 312 requests, sent resource pending for
0 requests, evaluation time: 1223288 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2303747,1215710991692,19630,12,0:,122:Resource
evaluation completed. Evaluated 304 requests, sent resource pending for
0 requests, evaluation time: 1385297 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2304323,1215712249423,19630,11,0:,122:Resource
evaluation completed. Evaluated 311 requests, sent resource pending for
0 requests, evaluation time: 1257729 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2304847,1215713417318,19630,12,0:,122:Resource
evaluation completed. Evaluated 300 requests, sent resource pending for
0 requests, evaluation time: 1167894 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2305555,1215714563259,19630,11,0:,122:Resource
evaluation completed. Evaluated 300 requests, sent resource pending for
6 requests, evaluation time: 1145939 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2306130,1215715234879,19630,12,0:,123:Resource
evaluation completed. Evaluated 284 requests, sent resource pending for
157 requests, evaluation time: 671619
msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2306685,1215716103486,19630,11,0:,122:Resource
evaluation completed. Evaluated 258 requests, sent resource pending for
26 requests, evaluation time: 868605 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2307131,1215716699437,19630,12,0:,121:Resource
evaluation completed. Evaluated 225 requests, sent resource pending for
0 requests, evaluation time: 595950 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2307633,1215717694512,19630,11,0:,121:Resource
evaluation completed. Evaluated 207 requests, sent resource pending for
0 requests, evaluation time: 995073 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2308132,1215718427500,19630,12,0:,122:Resource
evaluation completed. Evaluated 197 requests, sent resource pending for
10 requests, evaluation time: 732986 msec,25:ResBroker_i::doEvaluation,1
0,51216,118,118,2308621,1215719097241,19630,11,0:,122:Resource
evaluation completed. Evaluated 175 requests, sent resource pending for
44 requests, evaluation time: 669739 msec,25:ResBroker_i::doEvaluation,1

 

-----Original Message-----
From: mikemclain AT northwesternmutual DOT com
[mailto:mikemclain AT northwesternmutual DOT com] 
Sent: Thursday, July 10, 2008 1:27 PM
To: Donaldson, Mark - Broomfield, CO; Veritas-bu AT mailman.eng.auburn DOT edu
Subject: RE: [Veritas-bu] Long time between "requesting resources" and
"grantingresources"

Mark,

We experienced the same issue you are referring to when we upgraded to
NBU 6.0 MP5 last fall.     After working with Symantec support and
engineering they have made very good improvements in the performance of
the resource broker (nbrb).

The problem would manifest itself as more and more jobs (1000+) were
added to the queue and the fact that we were using storage unit groups
and Inline-tape-copy (ITC).    We would see requesting/granting wait
times for disk or tape jobs at 20-40 minutes even though the resources
were available.  Engineering enhanced the nbrb code to cache the
"resource availability" data rather than having to query the EMM
database for each allocation request.

I'm not sure if this log output is in the 6.5.1 nbrb, but you can view
the nbrb evaluation cycle performance data by parsing the raw nbrb (118)
vxlogs for the word "msec":

        grep msec /usr/openv/logs/51216-118-* 

It will show you the runtime stats and how well the caching is doing:

51216-118-1067800690-080705-0000000000.log:0,51216,118,118,3133941,12152
35374476,22178,10,0:,345:Resource evaluation completed. Evaluated 243
requests, evaluation time: 12398 msec, <EVALPERF> numSucceeded=2,
numQueued=241, numPended=0, numFailed=0, numMultipleCopyJobs=20,
numMdsCacheMisses=3, numMDARReleaseDuringEval=4, restartedEvalCycle=0,
numDriveReleasesBeforeEval=0, numDrivesAvailStart=211,
numDrivesAvailEnd=211, requestQueueSize=241,25:ResBroker_i::doEvaluation

# While we were troubleshooting this with Symantec, I ended up writing a
Perl script to parse these stats for graphing in Excel so we could see
the pain points and help narrow down the issue.

I believe the updated nbrb code should be in 6.5.2 & 6.0 MP6 and there
are some tweaks that can be done via a new config file
(/usr/openv/var/global/nbrb.conf) to improve performance.

        RESPECT_REQUEST_PRIORITY=0
        SECONDS_FOR_EVAL_LOOP_RELEASE=30

HTH

-Mike

-----Original Message-----
From: veritas-bu-bounces AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of
Mark.Donaldson AT cexp DOT com
Sent: Wednesday, July 09, 2008 5:41 PM
To: Veritas-bu AT mailman.eng.auburn DOT edu
Subject: [Veritas-bu] Long time between "requesting resources" and
"grantingresources"

Hi all,

I'm seeing some apparently slow backups.  When I dig into the job
details, I see that often there's a long delay between the "requesting
resources" messages and the "granting resources" messages.  

The master server is an 8 CPU, 16G Solaris 8 v880 box.  It's got
NBv6.5.1.

It's running with an average of 50% idle or so but the load average is
kinda high: 


last pid:  8851;  load averages:  4.01,  3.66,  3.60
11:50:46
170 processes: 164 sleeping, 1 zombie, 1 stopped, 4 on cpu
CPU states: 53.2% idle, 13.0% user, 28.8% kernel,  5.0% iowait,  0.0%
swap
Memory: 16G real, 8702M free, 2372M swap in use, 24G swap free



Here's the details of a job that ran slower than I'd like.  36 minutes
from "request" to "granted".  It's a high-priority backup, highest of
all of them, FWIW.  It, and many of my policies now, are backing up to
storage unit groups.   The AP-01 client below is an AIX media server
with NBv6.

07/09/2008 08:36:45 - requesting resource AP01_Pri
07/09/2008 08:36:45 - requesting resource ug-50.NBU_CLIENT.MAXJOBS.ap-01
07/09/2008 08:36:45 - requesting resource
ug-50.NBU_POLICY.MAXJOBS.AP-01_RM_Arch_iprd
07/09/2008 09:12:21 - granted resource  ug-50.NBU_CLIENT.MAXJOBS.ap-01
07/09/2008 09:12:21 - granted resource
ug-50.NBU_POLICY.MAXJOBS.AP-01_RM_Arch_iprd
07/09/2008 09:12:21 - granted resource  005195
07/09/2008 09:12:21 - granted resource  LTO2-06
07/09/2008 09:12:21 - granted resource  AP01-ADIC2
07/09/2008 09:12:25 - started process bpbrm (pid=33284230)
07/09/2008 09:12:26 - connecting
07/09/2008 09:12:37 - connected; connect time: 0:00:00
07/09/2008 09:12:49 - mounted 005195
07/09/2008 09:12:49 - positioning 005195 to file 79
07/09/2008 09:12:51 - positioned 005195; position time: 0:00:02
07/09/2008 09:12:51 - begin writing
07/09/2008 09:13:51 - end writing; write time: 0:01:00
the requested operation was successfully completed (0)

Anybody seeing this?  Any reasons you know of?

-Mark

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

This e-mail and any attachments may contain confidential information of
Northwestern Mutual. If you are not the intended recipient of this
message, be aware that any disclosure, copying, distribution or use of
this e-mail and any attachments is prohibited. If you have received this
e-mail in error, please notify Northwestern Mutual immediately by
returning it to the sender and delete all copies from your system.
Please be advised that communications received via the Northwestern
Mutual Secure Message Center are secure. Communications that are not
received via the Northwestern Mutual Secure Message Center may not be
secure and could be observed by a third party. Thank you for your
cooperation.


This e-mail and any attachments may contain confidential information of 
Northwestern Mutual. If you are not the intended recipient of this message, be 
aware that any disclosure, copying, distribution or use of this e-mail and any 
attachments is prohibited. If you have received this e-mail in error, please 
notify Northwestern Mutual immediately by returning it to the sender and delete 
all copies from your system. Please be advised that communications received via 
the Northwestern Mutual Secure Message Center are secure. Communications that 
are not received via the Northwestern Mutual Secure Message Center may not be 
secure and could be observed by a third party. Thank you for your cooperation.


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu