ADSM-L

Server Performance Problems When Restoring Large Filesystems

2015-10-04 17:58:04
Subject: Server Performance Problems When Restoring Large Filesystems
From: Purdon, James [SMTP:james_purdon AT MERCK DOT COM]
To: ADSM-L AT VM.MARIST DOT EDU
Hi,

  I have an ADSM client who has a system (SGI running IRIX) with
approximately 3.5 million files in a single file system (give or take a
few).  Three weeks ago, the file system crashed.  Since that time, we have
made a few discoveries:

*       A "dsmc q backup" of the file system takes more than 24 hours to
complete (in fact, we have never seen it complete).
*       A "dsmc restore -tapeprompt=no -subdir=yes /filesystem"  runs until
the connection times out.  We have tried setting COMMTimeout and
IDLETimeout
to 72000 and 1200 respectively, but to no avail.  The restore just takes
longer to time out.
*       If the client starts too many "dsmc restore" and/or "dsmc query
backup" sessions (too many being more than one) the server becomes
unavailable to all other client sessions, whether they be backup or admin
clients.  One the server we can see one dsmserv process consuming 99% of
the
CPU cycles.
*       There is no way to associate the ADSM session and/or process number
with the pid of a dsmserv process (so we can't tell which operation is
causing the problem).
*       Suspending (with a kill SIGSTOP) the dsmserv process does not allow
client connenctions to resume.
*       Killing the errant dsmserv process causes all dsmserv processes to
die (actually we knew this before, it just wasn't as annoying).
*       Estimates suggest that it will take more than 60 days to restore
all
the files.  We once restored twice the data (but in only 250,000 files) in
8
days.  It looks like ADSM performance is dependent on the number of files,
rather than the size of the data.  Estimates that you may have formed based
on device bandwidth may be misleading.
*       Our cache hit rate is 99.54% and our cache wait percent is 0, but
still ...
*       IBM is aware of the situation but has no plans to address or
improve
it.

Here's the results of "query occupancy" on the problematic file system
(this
is a tab-separated list)::

Node Name       Type    Filespace       Storage Number of       Space
                        Name    Pool Name       Files   Occupied
                                                (MB)
----------------------- ----    -----------             -----------
---------       ----------
---------       ----------
IRIX1234        Bkup    /filesystem     IBMBACKUP       3,686,817
IRIX1234        Bkup    /filesystem     IBMBACKUP       3,686,817
112,714.61
IRIX1234        Bkup    /filesystem     IBMPOOL 3,686,817       112,714.61
IRIX1234        Bkup    /filesystem     OFFSITE01       3,686,817
112,714.61

Our software is: AIX 3.2.5, DSMSERV 2.1.0.13, clients 2.1.0.6 and 2.1.0.8.

At this point it would probably be helpful if the AIX/ADSM tuning document
(occassionally mentioned in this mailing list) was publically accessable.

Jim