Veritas-bu

Re: [Veritas-bu] Long time between "requesting resources" and "grantingresources"

2008-07-10 15:55:26
Subject: Re: [Veritas-bu] Long time between "requesting resources" and "grantingresources"
From: <mikemclain AT northwesternmutual DOT com>
To: <Mark.Donaldson AT cexp DOT com>, <Veritas-bu AT mailman.eng.auburn DOT edu>
Date: Thu, 10 Jul 2008 14:26:41 -0500
Mark,

We experienced the same issue you are referring to when we upgraded to
NBU 6.0 MP5 last fall.     After working with Symantec support and
engineering they have made very good improvements in the performance of
the resource broker (nbrb).

The problem would manifest itself as more and more jobs (1000+) were
added to the queue and the fact that we were using storage unit groups
and Inline-tape-copy (ITC).    We would see requesting/granting wait
times for disk or tape jobs at 20-40 minutes even though the resources
were available.  Engineering enhanced the nbrb code to cache the
"resource availability" data rather than having to query the EMM
database for each allocation request.

I'm not sure if this log output is in the 6.5.1 nbrb, but you can view
the nbrb evaluation cycle performance data by parsing the raw nbrb (118)
vxlogs for the word "msec":

        grep msec /usr/openv/logs/51216-118-* 

It will show you the runtime stats and how well the caching is doing:

51216-118-1067800690-080705-0000000000.log:0,51216,118,118,3133941,12152
35374476,22178,10,0:,345:Resource evaluation completed. Evaluated 243
requests, evaluation time: 12398 msec, <EVALPERF> numSucceeded=2,
numQueued=241, numPended=0, numFailed=0, numMultipleCopyJobs=20,
numMdsCacheMisses=3, numMDARReleaseDuringEval=4, restartedEvalCycle=0,
numDriveReleasesBeforeEval=0, numDrivesAvailStart=211,
numDrivesAvailEnd=211, requestQueueSize=241,25:ResBroker_i::doEvaluation

# While we were troubleshooting this with Symantec, I ended up writing a
Perl script to parse these stats for graphing in Excel so we could see
the pain points and help narrow down the issue.

I believe the updated nbrb code should be in 6.5.2 & 6.0 MP6 and there
are some tweaks that can be done via a new config file
(/usr/openv/var/global/nbrb.conf) to improve performance.

        RESPECT_REQUEST_PRIORITY=0
        SECONDS_FOR_EVAL_LOOP_RELEASE=30

HTH

-Mike

-----Original Message-----
From: veritas-bu-bounces AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-bounces AT mailman.eng.auburn DOT edu] On Behalf Of
Mark.Donaldson AT cexp DOT com
Sent: Wednesday, July 09, 2008 5:41 PM
To: Veritas-bu AT mailman.eng.auburn DOT edu
Subject: [Veritas-bu] Long time between "requesting resources" and
"grantingresources"

Hi all,

I'm seeing some apparently slow backups.  When I dig into the job
details, I see that often there's a long delay between the "requesting
resources" messages and the "granting resources" messages.  

The master server is an 8 CPU, 16G Solaris 8 v880 box.  It's got
NBv6.5.1.

It's running with an average of 50% idle or so but the load average is
kinda high: 


last pid:  8851;  load averages:  4.01,  3.66,  3.60
11:50:46
170 processes: 164 sleeping, 1 zombie, 1 stopped, 4 on cpu
CPU states: 53.2% idle, 13.0% user, 28.8% kernel,  5.0% iowait,  0.0%
swap
Memory: 16G real, 8702M free, 2372M swap in use, 24G swap free



Here's the details of a job that ran slower than I'd like.  36 minutes
from "request" to "granted".  It's a high-priority backup, highest of
all of them, FWIW.  It, and many of my policies now, are backing up to
storage unit groups.   The AP-01 client below is an AIX media server
with NBv6.

07/09/2008 08:36:45 - requesting resource AP01_Pri
07/09/2008 08:36:45 - requesting resource ug-50.NBU_CLIENT.MAXJOBS.ap-01
07/09/2008 08:36:45 - requesting resource
ug-50.NBU_POLICY.MAXJOBS.AP-01_RM_Arch_iprd
07/09/2008 09:12:21 - granted resource  ug-50.NBU_CLIENT.MAXJOBS.ap-01
07/09/2008 09:12:21 - granted resource
ug-50.NBU_POLICY.MAXJOBS.AP-01_RM_Arch_iprd
07/09/2008 09:12:21 - granted resource  005195
07/09/2008 09:12:21 - granted resource  LTO2-06
07/09/2008 09:12:21 - granted resource  AP01-ADIC2
07/09/2008 09:12:25 - started process bpbrm (pid=33284230)
07/09/2008 09:12:26 - connecting
07/09/2008 09:12:37 - connected; connect time: 0:00:00
07/09/2008 09:12:49 - mounted 005195
07/09/2008 09:12:49 - positioning 005195 to file 79
07/09/2008 09:12:51 - positioned 005195; position time: 0:00:02
07/09/2008 09:12:51 - begin writing
07/09/2008 09:13:51 - end writing; write time: 0:01:00
the requested operation was successfully completed (0)

Anybody seeing this?  Any reasons you know of?

-Mark

_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

This e-mail and any attachments may contain confidential information of 
Northwestern Mutual. If you are not the intended recipient of this message, be 
aware that any disclosure, copying, distribution or use of this e-mail and any 
attachments is prohibited. If you have received this e-mail in error, please 
notify Northwestern Mutual immediately by returning it to the sender and delete 
all copies from your system. Please be advised that communications received via 
the Northwestern Mutual Secure Message Center are secure. Communications that 
are not received via the Northwestern Mutual Secure Message Center may not be 
secure and could be observed by a third party. Thank you for your cooperation.


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu