ADSM-L

Re: restore stg strangeness

2003-08-06 09:31:10
Subject: Re: restore stg strangeness
From: Richard Sims <rbs AT BU DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Wed, 6 Aug 2003 09:30:50 -0400
>I've seen some stuff I don't understand with a reclaim stg command and I'd
>like anyone who does understand to enlighten me.
>
>TSM server is 4.2.3.3 on AIX 5.1 RML03 running in a p690 LPAR.  All disk is
>FC attached to IBM ESS arrays, disk stgpools are not mirrored but DB and log
>are using AIX mirroring. All TSM files are on filesystems most are JFS, but
>newer stgpool data is on JFS2.
>
>We are very cautious here.  The TSM LPAR has two HBAs connected to separate san
>fabrics connected via multiple paths to two ESSes. Despite that, yesterday all
>four paths to one ESS dropped out together.  Nothing much was happening in TSM,
>at the time so we resynched the disks and continued.  However, because of
>unrelated issues hung over from the weekend, one of our disk pools was 99%
>full, so we decided to migrate all our data early.  During the migration the
>same disk dropped out again.  Some of the stgpool being migrated was on this
>disk and not mirrored, and  gave repeated error messages about errors reading
>the disk until it was brought back online to AIX, at which point the errors
>stopped and the migration continued, finishing with a FAILURE notification.
>
>Afterward there was no data to be migrated, but the diskpool and some of its
>volumes were'nt empty.  Accordingly I ran an AUDIT VOL FIX=3Dyes against one of
>the affected volumes.  This went OK, but on the second volume the TSM server
>died with an error attempting rollback and would not restart.
...

Steve - When you encounter problems at the hardware or OS level, you need to
        stop and address those - not continue with dependent applications work
or attempting to deal with the resulting problems at the application level
(TSM).  Your AIX and ESS specialists should be involved in resolving this
storage problem, pursuing AIX Error Log problem indications and ESS logged
errors, to uncover the cause of the instabilities which evidence themselves
when something goes to use the storage.  Volumes go offline due to serious
problems with the storage subsystem.  TSM is the victim, and only gets beat up
more if you attempt to continue operations without resolving the instability
issues.  Your company's data is being jeopardized by these problems: there
should be a priority effort to get them resolved.

  Richard Sims, BU

<Prev in Thread] Current Thread [Next in Thread>