ADSM-L

Re: dsmserv process hung.

2006-03-03 17:53:16
Subject: Re: dsmserv process hung.
From: Larry Peifer <peiferlt AT SONGS.SCE DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 3 Mar 2006 14:51:52 -0800
We too have just started to have this problem in the last 4 days.  In our
case the symptoms and solutions seem to fit in with what's described in
IBM Document Ref #: PK00196.  However that was to have been fixed with
5.3.1 release which we are using.  Can anyone shed more light on what
might be triggering this situation?
AIX 5.2 ML5
TSM 5.3.1.0

Here's a series of errors that cropped up this week for the first time.
Any insights would be helpful.

02/27/06   21:59:00      ANR9999D imgroup.c(1180): ThreadId<90> Error 8
retrieving
                          Backup Objects row for object 0.101495737
(SESSION: 2838)
02/27/06   21:59:00      ANR9999D ThreadId<90> issued message 9999 from:

                          <-0x000000010001bf74 outDiagf
<-0x00000001003fb114
                          imIsGroupLeader <-0x0000000100396b9c
SmNodeSession
                          <-0x000000010047f854 HandleNodeSession
                          <-0x0000000100485760 smExecuteSession
                          <-0x000000010051c3e4 SessionThread
<-0x000000010000e958
                          StartThread <-0x0900000000286460 _pthread_body
(SESSION:
                          2838)
02/27/06   21:59:00      ANR9999D smnode.c(7353): ThreadId<90> Session
2838:
                          Invalid Group Id 0,101495737 for ADD function
(SESSION:
                          2838)
02/27/06   21:59:00      ANR9999D ThreadId<90> issued message 9999 from:

                          <-0x000000010001bf74 outDiagf
<-0x0000000100396bc4
                          SmNodeSession <-0x000000010047f854
HandleNodeSession
                          <-0x0000000100485760 smExecuteSession
                          <-0x000000010051c3e4 SessionThread
<-0x000000010000e958
                          StartThread <-0x0900000000286460 _pthread_body
(SESSION:
                          2838)
02/28/06   23:24:55      ANR9999D lmlcaud.c(506): ThreadId<75> Error 17
checking
                          filespace data for license audit. (PROCESS: 72)

02/28/06   23:24:55      ANR9999D ThreadId<75> issued message 9999 from:

                          <-0x000000010001bf74 outDiagf
<-0x00000001006d8e70
                          LmLcAuditThread <-0x000000010000e958 StartThread

                          <-0x0900000000286460 _pthread_body  (PROCESS:
72)
03/01/06   11:20:55      ANR9999D lmlcaud.c(506): ThreadId<43> Error 17
checking
                          filespace data for license audit. (PROCESS: 79)

03/01/06   11:20:55      ANR9999D ThreadId<43> issued message 9999 from:

                          <-0x000000010001bf74 outDiagf
<-0x00000001006d8e70
                          LmLcAuditThread <-0x000000010000e958 StartThread

                          <-0x0900000000286460 _pthread_body  (PROCESS:
79)
03/03/06   03:41:10      ANR9999D lmlcaud.c(506): ThreadId<51> Error 17
checking
                          filespace data for license audit. (PROCESS: 29)

03/03/06   03:41:10      ANR9999D ThreadId<51> issued message 9999 from:

                          <-0x000000010001bf74 outDiagf
<-0x00000001006d8e70
                          LmLcAuditThread <-0x000000010000e958 StartThread

                          <-0x0900000000286460 _pthread_body  (PROCESS:
29)

In each case we need to halt and restart the TSM server to free up the
locks.  Finding slack time to do that is not always easy.





"Ochs, Duane" <Duane.Ochs AT QG DOT COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
01/30/2006 12:44 PM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
[ADSM-L] dsmserv process hung.






AIX 5.3
TSM 5.3.1.2
This weekend one of my three TSM servers had the DSMSERV process hang.
The machine was accessible, the DSMSERV process still existed. It was
still accepting connections but not talking to them. In turn our cross
server backups and volume reconciliation hung from the the other 2 TSM
servers. One server ended up crashing due to a full recovery log. The
other was near that same point. Looks like the root cause was a full
recovery log on the hung server.

I monitor to see if DSMSERV exists, I monitor for backup and archive
failures. I use operational reporting to give me additional information
for clients. I even monitor to make sure the client scheduler is running
and communicating.

Does anybody have a method in place or an idea to monitor if the TSM
server is actually capable of communication ?

Duane Ochs
Information Systems - Enterprise Computing
Quad/Graphics Inc.
Sussex, Wisconsin
414-566-2375 phone
414-566-4010 pin# 2375 beeper
Duane.Ochs AT qg DOT com
www.QG.com <outbind://8/www.QG.com>

<Prev in Thread] Current Thread [Next in Thread>