ADSM-L

ADSM multiple server crash

1999-06-19 16:48:35
Subject: ADSM multiple server crash
From: Lauer Edouard <Edouard.Lauer AT BIL-DEXIA DOT COM>
Date: Sat, 19 Jun 1999 22:48:35 +0200
Hi,


on friday night at 2:35am we had an ADSM server crash. Mostly what I've seen
in the activity log is a seek and write error on database volume
/dev/radsm_db2.
ADSM server is 3.1.2.20 running AIX 4.2.1. Please see adsm_crash.txt for
more
details of the error.
Afterwards I've restarted the server and everything worked well till 6:45am.
The
server than crashed again and this time there was no way to restart it.
Because
we had the server in roll-forward mode we've then began to restore it from
the last
database backup available.
Sounds good, but problem is because he re-applied the logs on the database
the same
error as the 2 server crashes came again. Second try we restored the
database without
reapplying the logs and afterwards we succeeded in starting the database.
The situation
was well and backups,restore could be done till 17pm where the server
crashed again
with the same errors.
At this point we have begin to thought that there could be a problem with
thread management
in version 3.1.2.20 of ADSM server. This conclusion was brought by the fact
that when
we analyzed the core dump done by the ADSM server crash we had following
line:
IOT/Abort trap in pthread_kill at 0xd03c1c6c ($t1769234249)
At this point we decided to install the oxford version (3.1.2.24) also it is
not officially
supported by IBM. What matters, at the point we were...We're friday 9pm and
no backups/restore
have been done.
After installing the new version we restarted the ADSM server and have done
an auditdb
on it with fix=yes. At this time we've disabled sessions so that nobody else
can go on the
ADSM server. Friday 11pm I decided to go home because I was really dead.
Today I came and what I saw was terrifying. The server had crashed again. I
restarted it
again so that some backups could be done but at this point I'm really out of
explanations...
For the moment I'm trying the following points:

        1. Increasing the size of bufpoolsize & logpoolsize     -
Status: Not working -> New crash
        2. Scratching all the db,log devices & recreating them  -
Status: Open
        3. Downgrading to version 3.1.0.5                       -
Status: Open

All I can say is that the last point is the horror scenario because I'm
working in a bank
and you can easily imagine how important are to backup our datas. Actually
we're only backing
up NT and UNIX systems (over 80 servers).
Second point is that memory and database handling in the 3.1.2.20 version
have decreased a lot
since version 3.1.0.5. 
For comparison: We've never had a problem with version 3.1.0.5. Since we
upgraded to 3.1.2.20 the
problems are accumulating...

Have a nice weekend  everyone,
_________________ Lauer Edouard ____________________
______ Prod. informatique ____ Systèmes Ouverts ________
__ * +352 4590 3889 __ * Edouard.Lauer AT bil-dexia DOT com __


=============================================================
An electronic message is not binding on its sender.
Any message referring to a binding engagement must be confirmed in writing and 
duly signed.

============================================================
<Prev in Thread] Current Thread [Next in Thread>