ADSM-L

Re: ADSM multiple server crash

1999-06-21 06:13:08
Subject: Re: ADSM multiple server crash
From: Kirsten Gloeer <Kirsten.Gloeer AT RZ.UNI-KARLSRUHE DOT DE>
Date: Mon, 21 Jun 1999 12:13:08 +0200
Hi,

It looks like a disk error of the disk where /dev/radsm_db2 resides in. 
Is there an error message in the errorlog of your ADSM server machine?

Best regards, Kirsten


According to Lauer Edouard:
> From owner-adsm-l AT VM.MARIST DOT EDU Sat Jun 19 23:37:39 1999
> Envelope-to: Kirsten.Gloeer AT RZ.UNI-KARLSRUHE DOT DE
> Delivery-date: Sat, 19 Jun 1999 23:37:39 +0200
> X-Server-Uuid: 67dfceb6-1339-11d2-9e77-00a0c9a3c45a
> X-Mailer: Internet Mail Service (5.5.2448.0)
> X-WSS-ID: 1B72DEC435693-01-02
> Message-ID:  <6967F2B02313D211B3900000F87A853E0261610B AT exchang1.bil DOT lu>
> Date:         Sat, 19 Jun 1999 22:48:35 +0200
> Reply-To:     "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> Sender:       "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
> From:         Lauer Edouard <Edouard.Lauer AT BIL-DEXIA DOT COM>
> Subject:      ADSM multiple server crash
> To:           ADSM-L AT VM.MARIST DOT EDU

> Hi,
> 
> 
> on friday night at 2:35am we had an ADSM server crash. Mostly what I've seen
> in the activity log is a seek and write error on database volume
> /dev/radsm_db2.
> ADSM server is 3.1.2.20 running AIX 4.2.1. Please see adsm_crash.txt for
> more
> details of the error.
> Afterwards I've restarted the server and everything worked well till 6:45am.
> The
> server than crashed again and this time there was no way to restart it.
> Because
> we had the server in roll-forward mode we've then began to restore it from
> the last
> database backup available.
> Sounds good, but problem is because he re-applied the logs on the database
> the same
> error as the 2 server crashes came again. Second try we restored the
> database without
> reapplying the logs and afterwards we succeeded in starting the database.
> The situation
> was well and backups,restore could be done till 17pm where the server
> crashed again
> with the same errors.
> At this point we have begin to thought that there could be a problem with
> thread management
> in version 3.1.2.20 of ADSM server. This conclusion was brought by the fact
> that when
> we analyzed the core dump done by the ADSM server crash we had following
> line:
> IOT/Abort trap in pthread_kill at 0xd03c1c6c ($t1769234249)
> At this point we decided to install the oxford version (3.1.2.24) also it is
> not officially
> supported by IBM. What matters, at the point we were...We're friday 9pm and
> no backups/restore
> have been done.
> After installing the new version we restarted the ADSM server and have done
> an auditdb
> on it with fix=yes. At this time we've disabled sessions so that nobody else
> can go on the
> ADSM server. Friday 11pm I decided to go home because I was really dead.
> Today I came and what I saw was terrifying. The server had crashed again. I
> restarted it
> again so that some backups could be done but at this point I'm really out of
> explanations...
> For the moment I'm trying the following points:
> 
>         1. Increasing the size of bufpoolsize & logpoolsize     -
> Status: Not working -> New crash
>         2. Scratching all the db,log devices & recreating them  -
> Status: Open
>         3. Downgrading to version 3.1.0.5                       -
> Status: Open
> 
> All I can say is that the last point is the horror scenario because I'm
> working in a bank
> and you can easily imagine how important are to backup our datas. Actually
> we're only backing
> up NT and UNIX systems (over 80 servers).
> Second point is that memory and database handling in the 3.1.2.20 version
> have decreased a lot
> since version 3.1.0.5. 
> For comparison: We've never had a problem with version 3.1.0.5. Since we
> upgraded to 3.1.2.20 the
> problems are accumulating...
> 
> Have a nice weekend  everyone,
> _________________ Lauer Edouard ____________________
> ______ Prod. informatique ____ Systèmes Ouverts ________
> __ * +352 4590 3889 __ * Edouard.Lauer AT bil-dexia DOT com __
> 
> 
> ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
> An electronic message is not binding on its sender.
> Any message referring to a binding engagement must be confirmed in writing 
> and duly signed.
> 
> ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
<Prev in Thread] Current Thread [Next in Thread>