ADSM-L

Re: AIX client dumping core during backup

2001-03-15 13:50:25
Subject: Re: AIX client dumping core during backup
From: bbullock <bbullock AT MICRON DOT COM>
Date: Thu, 15 Mar 2001 11:50:56 -0700
        Alright,
        One of my intrepid coworkers opened a case with Tivoli about this
error. I'll attach the e-mail from them, but in a nutshell:
This is an error only being seen on AIX clients.
"This problem has been observed at 4.1.1 and 3.7.2.X levels including
patch level 3.7.2.15."
There is an open APAR IC28528, since Oct 25th.
Their solution is to "Run backup manually or use cron to schedule the
backup."
They are pointing the finger at "NIS+" & indeed, the hosts we see this
error on are NIS+ clients.
They recommend certain NIS+ filesets to be at certain levels. We are a
little behind  on "bos.net.nisplus 4.3.3.27" and "bos.rte.libc 4.3.3.27", so
we will update those.
If this does not resolve the issue, we are to call them back and get on a
list of customers that are still up the creek.

Here is the text of the message if desire to read it in full.

-----Original Message-----
APAR= IC28528  SER=                            IN INCORROUT
APAR= IC28528  SER=                            IN INCORROUT
SCHEDULED CLIENT BACKUPS FAIL WITH B/A TXN CONSUMER THREAD,
FATAL ERROR, SIGNAL 11
STAT= OPEN         FESN0907344-     CTID= SJ0291 ISEV= 2
SB00/10/25  RC00/10/25  CL          PD           SEV= 2

ERROR DESCRIPTION:
Scheduled backups fail with B/A Txn Consumer thread, fatal error
, signal 11 - manual backups work. Dsmerror.log may have entries
similar to the following:
08/15/00   01:58:04 B/A Txn Consumer thread, fatal error, signal
08/15/00   01:58:04   0xD01D04E8 shadow_pass_r
08/15/00   01:58:04   0xD01D0210 shadow_chk_r

08/15/00   01:58:04   0xD01D1B60 _getpwuid_shadow_r
08/15/00   01:58:04   0xD01D4A7C getpwuid
08/15/00   01:58:04   0x1003277C *UNKNOWN*
Trace may show entries similar to:
signal.pthread_kill(??, ??) at 0xd0088f9c
signal._p_raise(??) at 0xd008846c
raise.raise(??) at 0xd01785ec
abort.abort() at 0xd0171d30
psunxthr.psAbort() at 0x100162a8
psunxthr.psTrapHandler() at 0x100163b4
getpwent.shadow_pass_r(??, ??) at 0xd01d04e8
getpwent.shadow_chk_r(??, ??) at 0xd01d020c
getpwent._getpwuid_shadow_r(??, ??, ??, ??, ??) at 0xd01d1b5c
getpwent.getpwuid(??) at 0xd01d4a78
pssec.GetIDFromOS() at 0x10032778
pssec.GetId() at 0x100329a8
pssec.idObjGetName() at 0x10032a78
senddata.sdSendObj() at 0x100dc8e0
txncon.sendIt() at 0x100d79fc
txncon.PFtxnListLoop() at 0x100d7fa8

txncon.PrivFlush2() at 0x100d8968
txncon.PrivFlush() at 0x100d958c
txncon.tlSend() at 0x100d9c64
bacontrl.HandleQueue__14DccTxnConsumerFv() at 0x100d66fc
bacontrl.Run__14DccTxnConsumerFPv() at 0x100d62a0
bacontrl.DoThread__14DccTxnConsumerFPv() at 0x100d5d24
thrdmgr.startThread() at 0x10018af4
pthread._pthread_body(??) at 0xd007c358
This problem has been observed at 4.1.1 and 3.7.2.X levels
including patch level 3.7.2.15.
LOCAL FIX:
Run backup manually or use cron to schedule the backup.

 We need to inform Jason that
                APAR IC28528 is an AIX problem and the Patches for this
                APAR are as follows:
                ~
                bos.net.nisplus 4.3.3.27
                bos.net.nis.client 4.3.3.25
                bos.net.nis.server 4.3.3.25
                bos.rte.libc 4.3.3.27

After first investigation, It could be due to the use of TSM in conjunct
 ion with NIS or could be related to Apar IC26906.
 First try to apply the fix IP22085 (-> install TSM client
 code 4.1.1), which correct this issue.
 This code can be downloaded at
 ftp://index.storsys.ibm.com/tivoli-storage-management/maintenance/client
 /v4r1/AIX/v411/.
 Here is the Apar abstract:
  ERROR DESCRIPTION:
  When using TSM 3.7.0, 3.7.1 or 3.7.2 client on a UNIX platform
  and backing up a large amount of data and directories, the
  client may terminate processing and produce an application
  core dump.
 .
  Screen output may include a (Signal/ 6) error.
 The dsmerror.log will show:
 B/A Txn Consumer thread, fatal error, signal 11
 and subsequent  hexidecimal error locations.
 Dsmsched.log will not provide any additional info.   Problem
 only occurs on backups and not archives.
 Problem occurs more frequently and earlier in processing when
 higher RESOURCEUTILZATION values are used and/or
 MEMORYEFFICIENT is turned on.
 .
 If it doesn't resolve your problem, this could be related to NIS
 I guess that NIS in implemented in your environment ?
 From the provided info, a signal 11 (segmentation violation / coredump)
 probably triggered when TSM tries to read invalid memory addresses. It
 starts when TSM makes an operating system call, getpwuid().
 In a normal AIX host (non NIS), the
 username/password of the users are stored in the /etc/passwd file on
 the local machine. But in an NIS client, this is not the case (This user
 information is stored on the NIS master / slave).
 In any case, getpwuid() is an AIX operating system function, that
 accesses the basic user information in the user database and returns a
 "passwd" structure.  The AIX documentation for this function gives the
 following entries for the "passwd" structure, and includes the following
 note/s, both of which seem appropriate in our case:
 pw_name Contains the name of the user name.
 pw_passwd Contains the user's encrypted password.
 Note: If the password is not stored in the /etc/passwd file and the
 invoker does not have access to the shadow file that contains passwords,
 this field contains an undecryptable string, usually an * (asterisk).
 pw_uid Contains the user's ID.
 pw_gid Identifies the user's principal group ID.
 pw_gecos Contains general user information.
 pw_dir Identifies the user's home directory.
 pw_shell Identifies the user's login shell.
  Note: If Network Information Services (NIS) is enabled on the
     system, these subroutines attempt to retrieve the information from
     the NIS authentication server (master or slave)
     before attempting to retrieve the information locally.
 .
 Based on this information, I think the problem lies in some
 NIS/AIX configuration or fine tuning.  From the above documentation, how
 the OS returns the user information to TSM should be transparent to TSM.
 However, since the OS is not returning the appropriate information to
 TSM, you get a core dump. My suggestion to you would be
 to involve AIX support to find out why this function is failing / not
 returning the appropriate information.
Michael Donkor
 IBM/ Adsm Support
Ronoake, TX 76262
email:donkorm AT us.ibm DOT com

"Style is a prisoner of neither vocation nor location" Unknown
____________________________________

Ben Bullock
UNIX Systems Manager
Micron Technology Inc.
>