ADSM-L

Re: Server Performance Problems When Restoring Large Filesystems

1998-05-13 10:45:50
Subject: Re: Server Performance Problems When Restoring Large Filesystems
From: "Purdon, James" <james_purdon AT MERCK DOT COM>
Date: Wed, 13 May 1998 10:45:50 -0400
Hi,
  The short answer is "ADSM supports an administratively heterogeneous
enviroment", which is management speak for "I don't manage the client
system".

  As I understand it, the filesystem in question is a journaled file system
and takes very little time for cleanup during system boots.  In addition, it
takes about
two hours to complete an "dsmc incr", which does not seem unreasonable to
the clients.  Of course, "dsmc restore" is another issue, but only when
trying to restore the whole kit and kaboodle.  A "dsmc restore" of an
individual file is fairly quick as well.  And thanks to the directory
structure, a "ls -l *", is pretty quick too (since it only shows part of the
directory structure). An "ls -lR" is probably a different story, but no one
has ever needed to do this for the whole file system.
The system is a 64-bit SGI/Cray Origin 2000, so performance doesn't seem to
be an issue.

  The nature of the data in this file system dictated the large number of
files. While the client has plans to place the particular application which
requires these files in its own filesystem, there are no plans to break it
down any further, though ADSM's virtual filespace feature may be deployed to
logically divide it.

Jim

> ----------
> From:         Lynch, Rich[SMTP:Lynch.Rich AT MBCO DOT COM]
> Sent:         Tuesday, May 12, 1998 5:46 PM
> To:   ADSM-L AT VM.MARIST DOT EDU
> Subject:      Re: Server Performance Problems When Restoring Large
> Filesystems
>
> Why cant you split up those files into multiple filesystems, with mount
> points under the current filesystem. This would make UNIX a lot happier
> in terms of filesystem cleanup time, backup/restore time, etc... I would
> hate to do a ls -l * on that filesytem!
>
>
>
>
>
>
>
> Richard Lynch
> AIX SYSTEMS ADMINISTRATOR
> MILLER BREWING COMPANY
> MILWAUKEE WI
> 414 931 2060
> Lynch.Rich AT mbco DOT com
>
> On the keyboard of life, keep one finger on the escape character
>
> > ----------
> > From:         Purdon, James[SMTP:james_purdon AT merck DOT com]
> > Sent:         Tuesday, May 12, 1998 4:04 PM
> > To:   ADSM-L AT VM.MARIST DOT EDU
> > Subject:      Server Performance Problems When Restoring Large
> > Filesystems
> >
> > Hi,
> >
> >   I have an ADSM client who has a system (SGI running IRIX) with
> > approximately 3.5 million files in a single file system (give or take
> > a
> > few).  Three weeks ago, the file system crashed.  Since that time, we
> > have
> > made a few discoveries:
> >
> > *       A "dsmc q backup" of the file system takes more than 24 hours
> > to
> > complete (in fact, we have never seen it complete).
> > *       A "dsmc restore -tapeprompt=no -subdir=yes /filesystem"  runs
> > until
> > the connection times out.  We have tried setting COMMTimeout and
> > IDLETimeout
> > to 72000 and 1200 respectively, but to no avail.  The restore just
> > takes
> > longer to time out.
> > *       If the client starts too many "dsmc restore" and/or "dsmc
> > query
> > backup" sessions (too many being more than one) the server becomes
> > unavailable to all other client sessions, whether they be backup or
> > admin
> > clients.  One the server we can see one dsmserv process consuming 99%
> > of the
> > CPU cycles.
> > *       There is no way to associate the ADSM session and/or process
> > number
> > with the pid of a dsmserv process (so we can't tell which operation is
> > causing the problem).
> > *       Suspending (with a kill SIGSTOP) the dsmserv process does not
> > allow
> > client connenctions to resume.
> > *       Killing the errant dsmserv process causes all dsmserv
> > processes to
> > die (actually we knew this before, it just wasn't as annoying).
> > *       Estimates suggest that it will take more than 60 days to
> > restore all
> > the files.  We once restored twice the data (but in only 250,000
> > files) in 8
> > days.  It looks like ADSM performance is dependent on the number of
> > files,
> > rather than the size of the data.  Estimates that you may have formed
> > based
> > on device bandwidth may be misleading.
> > *       Our cache hit rate is 99.54% and our cache wait percent is 0,
> > but
> > still ...
> > *       IBM is aware of the situation but has no plans to address or
> > improve
> > it.
> >
> > Here's the results of "query occupancy" on the problematic file system
> > (this
> > is a tab-separated list)::
> >
> > Node Name       Type    Filespace       Storage Number of       Space
> >                         Name    Pool Name       Files   Occupied
> >                                                 (MB)
> > ----------------------- ----    -----------             -----------
> > ---------       ----------
> > IRIX1234        Bkup    /filesystem     IBMBACKUP       3,686,817
> > 112,714.61
> > IRIX1234        Bkup    /filesystem     IBMPOOL 3,686,817
> > 112,714.61
> > IRIX1234        Bkup    /filesystem     OFFSITE01       3,686,817
> > 112,714.61
> >
> > Our software is: AIX 3.2.5, DSMSERV 2.1.0.13, clients 2.1.0.6 and
> > 2.1.0.8.
> >
> > At this point it would probably be helpful if the AIX/ADSM tuning
> > document
> > (occassionally mentioned in this mailing list) was publically
> > accessable.
> >
> > Jim
> >
>