ADSM-L

Re: Restore performance problem

2005-04-01 04:36:12
Subject: Re: Restore performance problem
From: Rainer Wolf <rainer.wolf AT KIZ.UNI-ULM DOT DE>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 1 Apr 2005 11:36:05 +0200
Hi,
I have done quite similar restores on our mailserver.
you may also look at the Client what happens to the
restore-process. It may happen that the cpu is at 100 % for 
the 'dsmc restore ..' ? Another thing is the filesystem on the 
Client and you may check the filesystem/disk-activity/Service-time if there 
is any 'weakness' that may result from creating that many i-nodes.

I have recently done a lot of mailserver-restores (always  3,5 mio Files/140 GB 
)
using an old tsm-server ( v5.1.9.5  with k-tapes and same konfig like you ... 
10 tapes )
and observed that specially this old tsm-server was at the end.
Especially our io-konfiguration of that old tsm-server was very bad :
db,log, disk-cache are mixed up. This decreases the restore-performance 
especially 
when other activity ( backups at night ) happens. 
So we used
dsmc restore -quiet /mail/ /data2/mail/
(tcpwindowsize 64, tcpbuffsize 32, largecommbuffers no, txnbytelimit 25600
resourceutilization 3)
and received the  3,5 mio Files/140 GB finally in 09:53:34
For me that was ok because I know about the bad server-constitution.
The restore time would be much more worse if the restore comes into a time
when the tsm-DB got a lot of other transactions - like nighly backups. 
... restoring the same with only one drive results in 51 hours .


Running the same mail-restore test on a new hardware ( new db, tsm5.3, with 
3592 Drives )
--using the same restore-client--- we finally got 3.5mio Files/150GB  restored 
in 04:52:00
... using just 1 drive because the data fits on 1 3599-tape.
But here I have experienced a reproduceable bug/behaviour ( it is in the moment 
'closed' because
the solaris10 is not yet supported ) : when starting the restore everything 
runs fine and 
fast ( with a restore-performance at about 1 mio Files/hour ) ... after some 
time -maybe 40 %
of the total restore time-  the cpu of the client is raising to 100 % and the 
restore performance ( data/files) is thus slowing down -- there is no reason 
for this found at the server
or at the client. 
... maybe it happens when a very big directory with a lot of directory in it  
is in progress ...
In the end I found a 'workaround': I canceled this slowed-down restore-process 
running at 100%CPU 
( 'dsmc restore -quiet /mail/ /data2/mail/' ) 
with Control-C, and let him shut down ... and then I just restart the restore 
with 
'dsmc restart restore -quiet' . This 'restarted restore' works fast again and 
finally 
ends  with the 04:52:00 (total time).   
If I would not stop/restart the client-restore-session the restore will 
end restoring with 06:49:09 .
That is reproduceable and it is a quite big difference 
( 30 % faster with interrupting and restarting ) 
but maybe its because of our unsupported tsm-version 
...  or has someone else seen this "cpu-crunching" behaviour  ?

Greetings 
Rainer



Thomas Denier wrote:
> 
> We recently restored a large mail server. We restored about nine million
> files with a total size of about ninety gigabytes. These were read from
> nine 3490 K tapes. The node we were restoring is the only node using the
> storage pool involved. We ran three parallel streams. The restore took
> just over 24 hours.
> 
> The client is Intel Linux with 5.2.3.0 client code. The server is mainframe
> Linux with 5.2.2.0 server code.
> 
> 'Query session' commands run during the restore showed the sessions in 'Run'
> status most of the time. Accounting records reported the sessions in media
> wait most of the time. We think most of this time was spent waiting for
> movement of tape within a drive, not waiting for tape mounts.
> 
> Our analysis has so far turned up only two obvious problems: the
> movebatchsize and movesizethreshold options were smaller than IBM
> recommends. On the face of it, these options affect server housekeeping
> operations rather than restores. Could these options have any sort of
> indirect impact on restore performance? For example, one of my co-workers
> speculated that the option values might be forcing migration to write
> smaller blocks on tape, and that the restore performance might be
> degraded by reading a larger number of blocks.
> 
> We are thinking of running a test restore with tracing enabled on the
> client, the server, or both. Which trace classes are likely to be
> informative without adding too much overhead? We are particularly
> interested in information on the server side. The IBM documentation for
> most of the server trace classes seems to be limited to the names of the
> trace classes.

-- 
------------------------------------------------------------------------
Rainer Wolf                          eMail:       rainer.wolf AT uni-ulm DOT de
kiz - Abt. Infrastruktur           Tel/Fax:      ++49 731 50-22482/22471
Universität Ulm                      wwweb:        http://kiz.uni-ulm.de

<Prev in Thread] Current Thread [Next in Thread>