ADSM-L

Re: Questions on NT performance on restores

1997-06-07 15:53:34
Subject: Re: Questions on NT performance on restores
From: Andrew Raibeck <storman AT US.IBM DOT COM>
Date: Sat, 7 Jun 1997 15:53:34 -0400
Angel Boles wrote:

> We are not getting good response on doing a large NT restore.
> Specifically, restoring an entire D drive that has about 30 gig with
> 500,000+ files.  Network is equivalent to FDDI, have FDDI connection
> into the ADSM server, NT client code is 2.1.06, ADSM server is AIX
> (4.1.5)  2.1.0.10, using 3590 tape technology in a 3494 automatic tape
> library.  dsm.opt parms have TCPBUFFSIZE   31 and   TCPWINDOWSIZE 24,
> compression is on, slowincr is no, and we are using tcp/ip.  There is
> nothing else going on the NT client when we are doing a restore.

> We can't seem to get more than 1/2 gig hr, somtimes less.  Any ideas how
> to improve this or where else to look?  We are getting bad press because
> of how long it would take to do an ADSM restore in case of a disaster.

> Is there a 'performance tuning'  type of document for NTs?

Unfortunately there's no easy answer to this, as performance problems can run
the gamut from the client hardware/software to the server hardware/software,
and everything in between. To further complicate matters, there may be more
than one bottleneck.

Some things to consider:

Is the the NT client waiting for tape resources that another client is using?

Do you have TXNGROUPMAX set to 256 and TXNBYTELIMIT set to 25600?

Do a QUERY DB F=D. Is your ADSM database achieving at least a 98% cache hit
ratio? Is the cache wait % 0.0? If not, you need to increase the server
BUFPOOLSIZE.

Do a QUERY LOG F=D. Is the ADSM recovery log showing a % log wait of 0.0? If
not, you need to increase the server LOGPOOLSIZE.

Is the ADSM server doing anything else at the time that might consume
resources (i.e. other client activity or server processes)?

Depending on the client hardware, decompressing the data during the restore
may slow things down. Try doing an uncompressed backup and restore to see if
that makes any difference.

What is the average file size you are restoring? Smaller files tend to have
worse performance characteristics than large files.

How much time is spent waiting for tape mounts?

Do you collocate your storage pools? Starting with the 2.1.x.12 server, you
can now collocate by filespace, in addition to collocating by node.
Uncollocated data may contribute to the problem.

Is the NT system dedicated to the restore, or is it performing other
functions besides the ADSM restore? (In your case, you said that the NT client
is dedicated to the restore.)

Are both the client and server operating systems optimally tuned? How about
the disk subsystems, TCP/IP stacks and other network-related settings?

A place to start would be to obtain an "instr_client_detail" trace of the
restore:

1) Identify a representative amount of data to try to restore (you don't need
to do all 30 GB, but try at least a couple of hundred MB or so).

2) Issue the restore with this command:

DSMC RESTORE filespec -SU=Y -TRACEFILE=TRACE.OUT -TRACEFLAGS=INSTR_CLIENT_DETAIL

(or use -SU=N, depending on what you want to restore)

3) After the restore completes, make a note of the restore statistics,
including the number of objects restored, total bytes transferred, etc.

4) Issue the ADSM admin command:

QUERY NODE nodename F=D

Record the bytes received, bytes sent, duration, % idle wait, % comm wait, and
% media wait.

5) Examine the TRACE.OUT file. It's only a dozen lines or so, showing you (from
the ADSM client's perspective) where ADSM is spending it's time. Here is what a
sample trace looks like:

Final Detailed Instrumentation statistics

Elapsed time:    17.430 sec

Section      Total Time(sec)  Average Time(msec)  Frequency used

------------------------------------------------------------------
Client Setup        0.460          460.0              1
Client Setup        0.460          460.0              1
Process Dirs        0.290          145.0              2
Solve Tree          0.000            0.0              0
Compute             0.000            0.0              0
Transaction         4.090            4.4            927
BeginTxn Verb       0.000            0.0              5
File I/O            3.700           10.5            351
Compression         0.130          130.0              1
Data Verb           8.610           15.1            569
Confirm Verb        0.000            0.0              0
EndTxn Verb         0.000            0.0              0
Client Cleanup      0.150          150.0              1

I can't go into a complete description of all of this, but I can see that about
half the time was spent in "Data Verb", which refers to time spent in sending
or receiving data to/from the communication layer. If this value looks
abnormally high, you might want to investigate both the ADSM server and the
network. This is where the % idle wait, % comm wait, and % media wait from the
QUERY NODE come into play. For instance, a large % comm wait would imply an
investigation of the client and the network. Since we already know that the
client thinks about half the time was spent in communications, the network is
the most likely suspect. This would include client and server network settings,
and the networking components between. (By the way, my sample trace here is of
such a small amount of data, that it's probaby statistically insignificant.)

Hopefully this information will help provide a starting point.

Andy Raibeck
ADSM Level 2 Support
<Prev in Thread] Current Thread [Next in Thread>