ADSM-L

MIRRORWRITE option

1998-03-26 22:32:18
Subject: MIRRORWRITE option
From: Trevor Foley <Trevor.Foley AT BANKERSTRUST.COM DOT AU>
Date: Fri, 27 Mar 1998 14:32:18 +1100
Hi,

This week we had had two instances of an extremely serious nature with
our ADSM server. We are now up and running again, so this message, for a
change, isn't asking for help. Rather it is to raise awareness of the
problems that we had, in the hope that some of you may be saved from
some of the discomfort that we have been through.

First, a little about our environment. We have 3 ADSM servers, but this
problem has only affected one of them:

*       ADSM server runs under AIX 4.2.1
*       ADSM server version 3.1.0.2
*       AIX mirroring used for operating system
*       ADSM mirror used for ADSM database and log
*       ADSM storage pool on raid-5 SSA disks
*       Operating system and database logs use the same physical pair of
disks (SSA)

Last Sunday night, the disk that is the primary member of the rootvg
mirror set, plus the primary copy of the ADSM logs went off-line. We
were unable to log into the system and, from the console log, we were
able to determine that the ADSM server had died soon after the disk
failure. Unfortunately we weren't notified of the failure because all of
the monitoring processes that were running on the system failed, yet the
system itself was still running, so external monitoring using ping still
showed the system as up.

Given that we were unable log into the machine we had no choice other
than to restart the server. By resetting the failed disk, we were able
to bring it back on-line. AIX started normally then, but ADSM would not
start. We were getting the infamous ANR9999 error, in this case saying
that a mutex acquisition had failed. The ADSM server, on attempting to
start, also reported that one volume the primary copy of the database
log had been varyed off-line. This shouldn't have been a problem, as we
should have had a valid second copy. However, the server refused to
start. We tried lots of things, including renaming log files, etc. to
try to get the server to start. After many hours of trying, we decided
that we had no choice but to restore to the previous full backup. This
of course invalidated the log (and naturally we had forgotten to take a
backup of it first). And the restore still failed. This was finally
resolved along the lines that I mentioned here a few days back i.e.
there is a know problem with the V3 server allowing client connections
while the database restore is running. We stopped that by removing the
TCPIP and HTTP commmethod lines and replacing them with a COMMMETHOD
SHAREDMEM line. After that the database restore OK, although we were now
at 10am Sunday, and it was now 6pm Monday.

The same disk drive failed in the same way 2 days later. Again, by
resetting the drive, we managed to get AIX up OK, but again ADSM would
not start. This time though there was no indication that any log volume
had been varyed off-line, and the error message was different ('run-time
assertion failed' rather than 'mutex acquisition failed').

However, ADSM would again not start. We tried a restore with
roll-forward, which completed successfully (after 4.5 hours), but ADSM
still would not start. IBM support at this stage recommended that we do
a dump/load/auditdb. But from what I have experienced before, and read
on this list, my guess was that this would have taken a minimum of 36
hours, and possibly more (our database is 14.5GB). The only other choice
we had was to again do a point-in-time restore to the previous day at
10am. We chose the later option, and had the server back on the air
around 2 hours later.

IBM support have recommended to us to change the MIRRORWRITE LOG option
in dsmserv.opt to sequential, rather than the default of parallel, as
there have been reported instances of log file corruption where, by
setting MIRRORWRITE LOG to sequential, the corruption could have been
avoided. If this is true, why have IBM not told their customer base
about this? Instead, they have created an extremely peeved customer.

We have not yet enabled this option, but will do so very soon. I am a
little wary about doing it as the performance impact is unknown (IBM
were unable to give us any idea).

So, for those of you use ADSM mirroring for your database log, I would
give consideration to changing this setting, and asking some question of
IBM. We have effectively been without this ADSM server with more than 3
days out of 5. Luckily for us it was our development server, rather than
one of our two production servers.


Trevor
<Prev in Thread] Current Thread [Next in Thread>