ADSM-L

Re: ADSM multiple server crash

1999-06-21 06:48:36
Subject: Re: ADSM multiple server crash
From: Lauer Edouard <Edouard.Lauer AT BIL-DEXIA DOT COM>
Date: Mon, 21 Jun 1999 12:48:36 +0200
Hello,

in the errpt of the machine we don't see no errors on the disk
where the ADSM db resides. Also we tried to reinstall the db
on another disk and the same problem did occur today.
All I can say now is that I'm very pleased about the product :-((((

Regards,
_________________ Lauer Edouard ____________________
______ Prod. informatique ____ Systèmes Ouverts ________
__ * +352 4590 3889 __ * Edouard.Lauer AT bil-dexia DOT com __


> -----Original Message-----
> From: Kirsten Gloeer [SMTP:Kirsten.Gloeer AT RZ.UNI-KARLSRUHE DOT DE]
> Sent: Monday, June 21, 1999 12:13 PM
> To:   ADSM-L AT VM.MARIST DOT EDU
> Subject:      Re: ADSM multiple server crash
> 
> Hi,
> 
> It looks like a disk error of the disk where /dev/radsm_db2 resides in. 
> Is there an error message in the errorlog of your ADSM server machine?
> 
> Best regards, Kirsten
> 
> 
> According to Lauer Edouard:
> > From owner-adsm-l AT VM.MARIST DOT EDU Sat Jun 19 23:37:39 1999
> > Envelope-to: Kirsten.Gloeer AT RZ.UNI-KARLSRUHE DOT DE
> > Delivery-date: Sat, 19 Jun 1999 23:37:39 +0200
> > X-Server-Uuid: 67dfceb6-1339-11d2-9e77-00a0c9a3c45a
> > X-Mailer: Internet Mail Service (5.5.2448.0)
> > X-WSS-ID: 1B72DEC435693-01-02
> > Message-ID:  <6967F2B02313D211B3900000F87A853E0261610B AT exchang1.bil DOT 
> > lu>
> > Date:         Sat, 19 Jun 1999 22:48:35 +0200
> > Reply-To:     "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> > Sender:       "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> > From:         Lauer Edouard <Edouard.Lauer AT BIL-DEXIA DOT COM>
> > Subject:      ADSM multiple server crash
> > To:           ADSM-L AT VM.MARIST DOT EDU
> 
> > Hi,
> > 
> > 
> > on friday night at 2:35am we had an ADSM server crash. Mostly what I've
> seen
> > in the activity log is a seek and write error on database volume
> > /dev/radsm_db2.
> > ADSM server is 3.1.2.20 running AIX 4.2.1. Please see adsm_crash.txt for
> > more
> > details of the error.
> > Afterwards I've restarted the server and everything worked well till
> 6:45am.
> > The
> > server than crashed again and this time there was no way to restart it.
> > Because
> > we had the server in roll-forward mode we've then began to restore it
> from
> > the last
> > database backup available.
> > Sounds good, but problem is because he re-applied the logs on the
> database
> > the same
> > error as the 2 server crashes came again. Second try we restored the
> > database without
> > reapplying the logs and afterwards we succeeded in starting the
> database.
> > The situation
> > was well and backups,restore could be done till 17pm where the server
> > crashed again
> > with the same errors.
> > At this point we have begin to thought that there could be a problem
> with
> > thread management
> > in version 3.1.2.20 of ADSM server. This conclusion was brought by the
> fact
> > that when
> > we analyzed the core dump done by the ADSM server crash we had following
> > line:
> > IOT/Abort trap in pthread_kill at 0xd03c1c6c ($t1769234249)
> > At this point we decided to install the oxford version (3.1.2.24) also
> it is
> > not officially
> > supported by IBM. What matters, at the point we were...We're friday 9pm
> and
> > no backups/restore
> > have been done.
> > After installing the new version we restarted the ADSM server and have
> done
> > an auditdb
> > on it with fix=yes. At this time we've disabled sessions so that nobody
> else
> > can go on the
> > ADSM server. Friday 11pm I decided to go home because I was really dead.
> > Today I came and what I saw was terrifying. The server had crashed
> again. I
> > restarted it
> > again so that some backups could be done but at this point I'm really
> out of
> > explanations...
> > For the moment I'm trying the following points:
> > 
> >         1. Increasing the size of bufpoolsize & logpoolsize     -
> > Status: Not working -> New crash
> >         2. Scratching all the db,log devices & recreating them  -
> > Status: Open
> >         3. Downgrading to version 3.1.0.5                       -
> > Status: Open
> > 
> > All I can say is that the last point is the horror scenario because I'm
> > working in a bank
> > and you can easily imagine how important are to backup our datas.
> Actually
> > we're only backing
> > up NT and UNIX systems (over 80 servers).
> > Second point is that memory and database handling in the 3.1.2.20
> version
> > have decreased a lot
> > since version 3.1.0.5. 
> > For comparison: We've never had a problem with version 3.1.0.5. Since we
> > upgraded to 3.1.2.20 the
> > problems are accumulating...
> > 
> > Have a nice weekend  everyone,
> > _________________ Lauer Edouard ____________________
> > ______ Prod. informatique ____ Systèmes Ouverts ________
> > __ * +352 4590 3889 __ * Edouard.Lauer AT bil-dexia DOT com __
> > 
> > 
> > ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
> > An electronic message is not binding on its sender.
> > Any message referring to a binding engagement must be confirmed in
> writing and duly signed.
> > 
> > ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
> 
> =============================================================
> An electronic message is not binding on its sender.  
> Any message referring to a binding engagement must be confirmed in writing
> and duly signed.
> 
> =============================================================

=============================================================
An electronic message is not binding on its sender.
Any message referring to a binding engagement must be confirmed in writing and 
duly signed.

============================================================
<Prev in Thread] Current Thread [Next in Thread>