ADSM-L

Problems with COMMMETHOD SHMEM and hanging processes, etc.

1999-07-20 11:54:14
Subject: Problems with COMMMETHOD SHMEM and hanging processes, etc.
From: John Schneider <jdschn AT IBM DOT NET>
Date: Tue, 20 Jul 1999 10:54:14 -0500
Greetings,
    We are running ADSM V3.1.2.20 under AIX 4.3.2 with
fairly recent AIX maintenance.
    We have had three different weird symptoms all related to use of
shared memory.  The first one was a 'backup stgpool' command that
would always hang on the same file on a certain tape.  We had two
tapes that had this same symptom, so I don't think it was just a bad
tape.  There was no error message, and it was a relatively small file
(around 52MB).  The 'backup stgpool' would just hang and never
complete.
    The second symptom was related, in that you could not cancel
the process once it got in to that state.  The 'cancel process' command
would be accepted, but the process would live forever.
    This symptom has happened several times over a period of days,
and each time a restart of the ADSM was necessary to get us
back up in business.
    OK, fine, the only solution is to restart the ADSM server.  So one
day we take down the ADSM server, and attempt to restart it. But the
server would not start properly.  Once it was started you could not
connect to it.  Starting the server in the foreground so you could see
the messages, the server was getting:

ANR8208W TCP/IP driver unable to initialize due to
error in binding to Port 1500, reason code 67.
ANR8191W HTTP driver unable to initialize due to
error in binding to Port 1580, reason code 67.
ANR8295W Shared memory driver unable to initialize due to
error in binding to Port 1510, reason code 67.

For some reason none of the communications drivers were initializing
properly, so the server couldn't talk to anything.  We went to Level I
who's solution was to reboot the machine.  NO CAN DO, not on
this production server!  We made sure all server processes and all
client processes on the system were killed, so nothing could be
holding that port open, but nothing worked.  We escalated to Level II,
who was not much help either.  They could not even come up with
anything we should do to gather information to solve it.

But while I was waiting for another callback from Level II, just on a
hunch I took Commmethod shmem out of the dsmserv.opt file and
corresponding lines out of the dsm.sys file, and restarted the server.
The reason I tried this is because we just started using shared memory
a couple months ago, so it was a fairly recent change, and because
it was mentioned in the error messages.

The server came right up!  For some reason shared memory was keeping
the TCP/IP and HTTP drivers, as well as the shared memory one,
from initializing?  What is going on with that?

But not only that, but it cured the other problems too!   The 'backup
storagepool' worked fine, and copied the offending tapes without
a hitch.

I spoke to Level II about this, and they said yes, they could see how
shared memory could cause these symptoms, but their solution was
not to use shared memory if we have problems with it!  I am not
happy with this solution.  I think we have a defect.

Has anyone else seen this?  Any suggestions?

John Schneider

***********************************************************************
* John D. Schneider       Email: jdschn AT ibm DOT net * Phone: 314-349-4556
* Lowery Systems, Inc.
* 1329 Horan                  Disclaimer: Opinions expressed here are
* Fenton, MO 63026                   mine and mine alone.
***********************************************************************
<Prev in Thread] Current Thread [Next in Thread>
  • Problems with COMMMETHOD SHMEM and hanging processes, etc., John Schneider <=