ADSM-L

Re: Server Performance Problems When Restoring Large Filesystems

1998-05-12 17:46:02
Subject: Re: Server Performance Problems When Restoring Large Filesystems
From: "Lynch, Rich" <Lynch.Rich AT MBCO DOT COM>
Date: Tue, 12 May 1998 16:46:02 -0500
Why cant you split up those files into multiple filesystems, with mount
points under the current filesystem. This would make UNIX a lot happier
in terms of filesystem cleanup time, backup/restore time, etc... I would
hate to do a ls -l * on that filesytem!







Richard Lynch
AIX SYSTEMS ADMINISTRATOR
MILLER BREWING COMPANY
MILWAUKEE WI
414 931 2060
Lynch.Rich AT mbco DOT com

On the keyboard of life, keep one finger on the escape character

> ----------
> From:         Purdon, James[SMTP:james_purdon AT merck DOT com]
> Sent:         Tuesday, May 12, 1998 4:04 PM
> To:   ADSM-L AT VM.MARIST DOT EDU
> Subject:      Server Performance Problems When Restoring Large
> Filesystems
>
> Hi,
>
>   I have an ADSM client who has a system (SGI running IRIX) with
> approximately 3.5 million files in a single file system (give or take
> a
> few).  Three weeks ago, the file system crashed.  Since that time, we
> have
> made a few discoveries:
>
> *       A "dsmc q backup" of the file system takes more than 24 hours
> to
> complete (in fact, we have never seen it complete).
> *       A "dsmc restore -tapeprompt=no -subdir=yes /filesystem"  runs
> until
> the connection times out.  We have tried setting COMMTimeout and
> IDLETimeout
> to 72000 and 1200 respectively, but to no avail.  The restore just
> takes
> longer to time out.
> *       If the client starts too many "dsmc restore" and/or "dsmc
> query
> backup" sessions (too many being more than one) the server becomes
> unavailable to all other client sessions, whether they be backup or
> admin
> clients.  One the server we can see one dsmserv process consuming 99%
> of the
> CPU cycles.
> *       There is no way to associate the ADSM session and/or process
> number
> with the pid of a dsmserv process (so we can't tell which operation is
> causing the problem).
> *       Suspending (with a kill SIGSTOP) the dsmserv process does not
> allow
> client connenctions to resume.
> *       Killing the errant dsmserv process causes all dsmserv
> processes to
> die (actually we knew this before, it just wasn't as annoying).
> *       Estimates suggest that it will take more than 60 days to
> restore all
> the files.  We once restored twice the data (but in only 250,000
> files) in 8
> days.  It looks like ADSM performance is dependent on the number of
> files,
> rather than the size of the data.  Estimates that you may have formed
> based
> on device bandwidth may be misleading.
> *       Our cache hit rate is 99.54% and our cache wait percent is 0,
> but
> still ...
> *       IBM is aware of the situation but has no plans to address or
> improve
> it.
>
> Here's the results of "query occupancy" on the problematic file system
> (this
> is a tab-separated list)::
>
> Node Name       Type    Filespace       Storage Number of       Space
>                         Name    Pool Name       Files   Occupied
>                                                 (MB)
> ----------------------- ----    -----------             -----------
> ---------       ----------
> IRIX1234        Bkup    /filesystem     IBMBACKUP       3,686,817
> 112,714.61
> IRIX1234        Bkup    /filesystem     IBMPOOL 3,686,817
> 112,714.61
> IRIX1234        Bkup    /filesystem     OFFSITE01       3,686,817
> 112,714.61
>
> Our software is: AIX 3.2.5, DSMSERV 2.1.0.13, clients 2.1.0.6 and
> 2.1.0.8.
>
> At this point it would probably be helpful if the AIX/ADSM tuning
> document
> (occassionally mentioned in this mailing list) was publically
> accessable.
>
> Jim
>