ADSM-L

Slow restore for large NT client outcome.. appeal to Tivoli Development/Support

2000-09-20 12:27:51
Subject: Slow restore for large NT client outcome.. appeal to Tivoli Development/Support
From: Jeff Connor <connorj AT NIAGARAMOHAWK DOT COM>
Date: Wed, 20 Sep 2000 12:21:43 -0400
I posted the memo below to this listserv last week when we were
having trouble with the performance restoring a large NT drive.
This memo is for the people who wanted to know how we made out in
the end.  I am also writing to bring to the attention of TSM
development what I feel is a pretty big problem in the area of
performance for clients with lots of small files.  I am pursuing
this issue with Tivoli through other channels but thought others
on this listserv might have the same concern. For a summary of
our TSM config see my first memo below.

First lets get a couple things out of the way.  I have been
working with TSM/ADSM for approximately five years since
version 1.  I am a HUGE fan of the product and have fought very
hard to get our company to standardize on TSM and leave Arcserve,
Backup exec, Legato, and the like.  I am pleased with
improvements in TSM functionality over the years.  The second
thing is we run TSM on OS/390. Over time I've seen many posts on
the listserv about users that have achieved better performance
with UNIX based TSM servers.  We are currently piloting TSM on
AIX to test the performance.

Now that we've established my loyalties, back to my concern about
backup, and more importantly restore, performance for TSM clients
with lots of small files.  Most of our UNIX servers are database
servers so my concerns about small files really pertain mostly to
Windows NT server clients.  Others may have issues with other
platforms.  The NT clients I have restore issues with are big
file and print servers.  The data partition is typically the D:
drive and can be anywhere from 20GB to 160GB in size.   The best
restore time we can achieve for the file and print servers is
somewhere between 1.5GB and 3.5GB per hour generally on the lower
side. Now we could go through a lot of the common, is your
network performing, is your database cache hit high enough, tcp
window sizes, txn sizes, and the usual things but assume for a
moment that we are optimally configured and done all "the right
stuff".  To make a performance comparison, we have a couple NT
clients that contain a small number of file and they are large
files.  We restored 20GB of data on one of those servers recently
in 1hr 45mins.  The restore of the one directory on the D:
partition for the client mentioned in my first memo below with an
average file size of 64K ran for 6hrs 5mins and transferring
4.8GB.  The whole drive took 45hrs.

Our NT group was a hard sell for replacing Arcserve with TSM.
Since the switch, I have taken quite a beating about TSM restore
performance.  Our NT admins take the position, "we'll try TSM but
if the performance doesn't improve we are going with a tried and
true solution like Compaq Enterprise Backup.  TSM seems to us
like a UNIX product trying to make it in the NT space.  It is not
typically selected by companies for NT backup and recovery".
Not a word for word quote but generally sums up their position.
The Compaq solution would use Arcserve from what I've been told.

I know Tivoli/IBM have tried to address the small files issue
with things like small file aggregation but I haven't noticed
much improvement from version to version for big restores of
servers with small files.  I've heard different reasons for slow
performance with small files over the years like the amount of
TSM database lookups, NT file system processing/inefficiencies,
etc.  When looking at future directions for SAN backups I can
understand the argument that the SAN pipes will be faster and
TCPIP overhead will be eliminated leading to faster
restores/backups.  But if the poor performance for small files
has a lot to do with TSM database lookups/overhead then how will
performance be different when the data travels over the SAN
versus the LAN/WAN?  The database processing about file
information will be pretty much the same won't it?  I have
suggested to our NT admins that we break that big D: partition
into multiple smaller partitions so I can collocate by filespace
and restore multiple drives concurrently.  Frankly, they are not
interested in changing the way they configure their servers to
accommodate the backup software.  They feel they would not have
to do this with Arcserve or other more common NT backup products.
I've tried tests using share names for folders and performing
backups/restores using the UNC name, collocating the data by
filespace and running concurrent restores.  My tests showed
improved elapsed time but this scheme would be tough to maintain.
In a full server restore scenario  I'd need to create the folders
and shares for the target restore which means we'd need to keep
track of that info some place.  I'd constantly have to monitor
growth in all the folders to make sure I've carved up the drive
in fairly equal parts to optimize for restore, etc.  Not a good
solution either.

Does anyone else see the poor performance for restoring clients
with lots of small files and feel that this is a problem Tivoli
needs to address?  I do.  If this issue is not resolved then I
won't be able to keep using TSM to backup our NT servers.

Thanks,
Jeff Connor
Niagara Mohawk Power Corp.


---------------------- Forwarded by Jeffrey P Connor/IT/NMPC on
09/20/2000 10:32 AM ---------------------------
09/20/2000 10:32 AM ---------------------------


Jeffrey P Connor
09/13/2000 01:20 PM

To:   ADSM-L AT VM.MARIST DOT EDU
cc:

Subject:  Slow restore for large NT client.. help!


     We are in the process of restoring a subdirectory of a very
large NT client file space (D:) and it is running really slow.  I
thought I'd see if any of you have some ideas as to where we can
look for bottlenecks.
The client config is:
     Compaq proliant 5500
     400MB RAM
     two 400MHz Xeon processors.
     ~160GB of disk in a Compaq disc array made up of 18.2GB
drives
     Windows NT 4.0 SP6a
     TSM client for NT 3.7.2.01
     Applicable TSM client options:
          tcpwindowsize 63
          tcpbuffsize         31
          tcpnodelay       yes
          txnbytelimit       25600

TSM server config
     TSM for OS/390 V3.7.1.0
     OS/390 2.6
     9672-R55
     TSM server DB cache hit ratio 98.5%
     ApplicableTSM server options:
          TXNGROUPMAX 256
          Databufferpoolsize  262144



Network path:
     NT Client ----100Mbit Ethernet --> Switch -- 100Mbit
Ethernet--> Cisco 7513 rtr -- 155Mbit ATM -> Cisco 5500 atm
switch -->IBM 2216 -->ESCON --> S/390 TSM Server


Now that you have the background here's what we are seeing.

Only 4.7GB have been transfered in 4 hours.  We are attempting to
restore one subdirectory on the D: drive first.
TSM command line client command entered was:
     RES -subd=y \\filecluster2\d$\groups\ugitoper\*
The D: drive has approximately 2,000,000 files.  Lots of small
files.  NT client is a file and print server.
A network sniffer trace shows mostly large chunks of data sent,
no restransmits, then the NT client appears to throttle back,
decreasing the tcpwindow size as if it could not accept the data
as fast as TSM was sending it.  Windows sizes go to zero at times
then bounces back to large window size(64512).
NT perfmon shows plenty of memory and cpu with minimal disk
queueing.



This brings me to my question.  What tools can I use or what
metrics in perfmon can I check to see "under the covers"
to determine what is slowing us down.  The network support staff
feels the network bandwidth is there and feel the NT client is
throttleling things back.  The NT support staff says the NT
client machine is not overwhelmed in terms of CPU, Memory, disk,
etc. they feel TSM is the problem.

What could be the bottleneck on the NT client and what tool can I
use to find it?

Thanks in advance for your assistance,
Jeff Connor
Niagara Mohawk Power Corp