ADSM-L

Re: Server Performance Problems When Restoring Large Filesystems

1998-05-13 10:04:45
Subject: Re: Server Performance Problems When Restoring Large Filesystems
From: "Purdon, James" <james_purdon AT MERCK DOT COM>
Date: Wed, 13 May 1998 10:04:45 -0400
Hi,
   Naturally, when the clients' attempts to do a single massive restore
failed, I suggested that they do multiple partial restores.  Unfortunately,
they had no idea what was on the filesystem - forcing them to do the "dsmc
query backup"  (which as I mentioned before, does not complete in under 24
hours) in order to get a listing of what was on the filesystem..

  I have yet to experience the joys of the version 3 server.   However, I
think changed behavior you describe is encountered during backups, not
restores.  During restores, the server must produce a list of files and
their associated tapes.  No matter whether sorting is done on the client or
server, I do not believe ADSM will restore even a single file until it has
a) completed the list and b) sorted the files by tape. Without the sort, a
tape mount could potentially be required for every single file.  At 45
seconds a tape mount in an IBM3494, that comes to four years for 3.5 million
files.

   Regardless of the naivity of my clients, I feel the larger issues are:

*       Why should the server hang when multiple sorts are requested by the
client?
*       Why is there no correlation between dsmserv pid and ADSM session and
process ids?
*       Why should killing a single dsmserv process crash the whole server?
*       What is the best way to tune my server so that database seek times
are minimized?
*       How does ADSM performance relate to the number of files being
restored?

   As disk space gets cheaper and journaled file systems and large RAID
arrays proliferate in the UNIX environment, filesystems with millions of
files will become common.  ADSM needs to be able to handle these
environments gracefully.

Jim

> ----------
> From:         Kelly J. Lipp[SMTP:lipp AT STORSOL DOT COM]
> Sent:         Tuesday, May 12, 1998 5:39 PM
> To:   ADSM-L AT VM.MARIST DOT EDU
> Subject:      Re: Server Performance Problems When Restoring Large
> Filesystems
>
> This is in the category of "Dr., Dr., it hurts when I do this!  So don't
> do
> this!"
>
> Is it possible to begin the restore at a lower level in the directory tree
> and bite this off in smaller chunks?  The timeout is coming while the
> client is retrieving a list of files to restore, 3.5 M in your case.  With
> ADSM V3 a no query restore is used.  That is, the client no longer needs
> to
> have the list of files before a restore can start.  This is my
> understanding anyway.
>
> Kelly Lipp
> Storage Solutions Specialists, Inc.
> lipp AT storsol DOT com
> www.storsol.com
> (719) 531-5926
>
> -----Original Message-----
> From:   Purdon, James [SMTP:james_purdon AT MERCK DOT COM]
> Sent:   Tuesday, May 12, 1998 3:05 PM
> To:     ADSM-L AT VM.MARIST DOT EDU
> Subject:        Server Performance Problems When Restoring Large
> Filesystems
>
> Hi,
>
>   I have an ADSM client who has a system (SGI running IRIX) with
> approximately 3.5 million files in a single file system (give or take a
> few).  Three weeks ago, the file system crashed.  Since that time, we have
> made a few discoveries:
>
> *       A "dsmc q backup" of the file system takes more than 24 hours to
> complete (in fact, we have never seen it complete).
> *       A "dsmc restore -tapeprompt=no -subdir=yes /filesystem"  runs
> until
> the connection times out.  We have tried setting COMMTimeout and
> IDLETimeout
> to 72000 and 1200 respectively, but to no avail.  The restore just takes
> longer to time out.
> *       If the client starts too many "dsmc restore" and/or "dsmc query
> backup" sessions (too many being more than one) the server becomes
> unavailable to all other client sessions, whether they be backup or admin
> clients.  One the server we can see one dsmserv process consuming 99% of
> the
> CPU cycles.
> *       There is no way to associate the ADSM session and/or process
> number
> with the pid of a dsmserv process (so we can't tell which operation is
> causing the problem).
> *       Suspending (with a kill SIGSTOP) the dsmserv process does not
> allow
> client connenctions to resume.
> *       Killing the errant dsmserv process causes all dsmserv processes to
> die (actually we knew this before, it just wasn't as annoying).
> *       Estimates suggest that it will take more than 60 days to restore
> all
> the files.  We once restored twice the data (but in only 250,000 files) in
> 8
> days.  It looks like ADSM performance is dependent on the number of files,
> rather than the size of the data.  Estimates that you may have formed
> based
> on device bandwidth may be misleading.
> *       Our cache hit rate is 99.54% and our cache wait percent is 0, but
> still ...
> *       IBM is aware of the situation but has no plans to address or
> improve
> it.
>
> Here's the results of "query occupancy" on the problematic file system
> (this
> is a tab-separated list)::
>
> Node Name       Type    Filespace       Storage Number of       Space
>                         Name    Pool Name       Files   Occupied
>                                                 (MB)
> ----------------------- ----    -----------             -----------
> ---------       ----------
> IRIX1234        Bkup    /filesystem     IBMBACKUP       3,686,817
> 112,714.61
> IRIX1234        Bkup    /filesystem     IBMPOOL 3,686,817       112,714.61
> IRIX1234        Bkup    /filesystem     OFFSITE01       3,686,817
> 112,714.61
>
> Our software is: AIX 3.2.5, DSMSERV 2.1.0.13, clients 2.1.0.6 and 2.1.0.8.
>
> At this point it would probably be helpful if the AIX/ADSM tuning document
> (occassionally mentioned in this mailing list) was publically accessable.
>
> Jim
>