ADSM-L

Re: dsmserv process hung.

2006-03-04 00:09:21
Subject: Re: dsmserv process hung.
From: Josh-Daniel Davis <xaminmo AT OMNITECH DOT NET>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 3 Mar 2006 22:59:02 -0600
This happens when 2 threads start to back up the system object, and the
second one starts sending data before the first one is able to create the
group leader, which is the anchor for management and expiration of the
entire system object as a single entity even though it's made of multiple
objects.

As a workaround, you can set resourceutil to 2 on all of your windows
clients, do another backup of the system objects, and expire the old ones
(through policy changes or just by waiting).

The hang is related to the defect involving RESTORE STGVOL.  We had the
same problem; however, the RESTORE STGVOL process never actually made its
way into the process table.  I would initially be able to get in and HALT
dsmserv.  Officially, the defect indicated that if left to its own
devices, the lock condition would degrade to unreachability.

The fix is in 5.3.2.3.

HOWEVER, We upgraded to 5.3.2.3 and have had SERIOUS lock issues.

SHOW DEADLOCK doesn't show anything.  Actlog will periodically show a
swarm of errors about operations failing due to lock issues, similar to:

2006-02-26 13:00:18.000000      ANR2033E UPDATE STGPOOL: Command failed -
lock conflict. (SESSION: 124639)
2006-02-26 13:00:18.000000      ANR2033E QUERY STGPOOL: Command failed -
lock conflict. (SESSION: 124664)
2006-02-26 13:00:18.000000      ANR2033E QUERY DRMEDIA: Command failed -
lock conflict. (SESSION: 124670)

and similar.

ALSO

MIGRATE STG will lock tables in such a way that Q STG will hang, but Q
PROC and Q SES work.  Client sessions will continue writing to whatever
volume they have; however, most new sessions will also hang.  Once the
offending process is killed, everything resumes.

ALSO

I've found that REPAIR STGVOL has been showing up a very often (a
subprocess of RECLAIM STG).

ALSO

Tonight, REPAIR STGVOL, 2 RECLAIM STG and one AUDIT LIC were all running
and had hung.  Unfortunately, I didn't pull dbtxn, txn, lock, etc info
prior to issuing HALT.

ALSO

dsmserv seems to chew up more CPU now than at 5.3.1.6 and 5.3.2.1;
however, I don't have quantitative measurements of the previous levels.

I'm not sure if this progression of locking issues is limited to us or is
a 5.3.2.3 problem; however, I'm very worried about the safety and
stability of TSM.


-Josh

On 06.03.03 at 14:51 peiferlt AT SONGS.SCE DOT COM wrote:

Date: Fri, 3 Mar 2006 14:51:52 -0800
From: Larry Peifer <peiferlt AT SONGS.SCE DOT COM>
Reply-To: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: dsmserv process hung.

We too have just started to have this problem in the last 4 days.  In our
case the symptoms and solutions seem to fit in with what's described in
IBM Document Ref #: PK00196.  However that was to have been fixed with
5.3.1 release which we are using.  Can anyone shed more light on what
might be triggering this situation?
AIX 5.2 ML5
TSM 5.3.1.0

Here's a series of errors that cropped up this week for the first time.
Any insights would be helpful.

02/27/06   21:59:00      ANR9999D imgroup.c(1180): ThreadId<90> Error 8
retrieving
                         Backup Objects row for object 0.101495737
(SESSION: 2838)
02/27/06   21:59:00      ANR9999D ThreadId<90> issued message 9999 from:

                         <-0x000000010001bf74 outDiagf
<-0x00000001003fb114
                         imIsGroupLeader <-0x0000000100396b9c
SmNodeSession
                         <-0x000000010047f854 HandleNodeSession
                         <-0x0000000100485760 smExecuteSession
                         <-0x000000010051c3e4 SessionThread
<-0x000000010000e958
                         StartThread <-0x0900000000286460 _pthread_body
(SESSION:
                         2838)
02/27/06   21:59:00      ANR9999D smnode.c(7353): ThreadId<90> Session
2838:
                         Invalid Group Id 0,101495737 for ADD function
(SESSION:
                         2838)
02/27/06   21:59:00      ANR9999D ThreadId<90> issued message 9999 from:

                         <-0x000000010001bf74 outDiagf
<-0x0000000100396bc4
                         SmNodeSession <-0x000000010047f854
HandleNodeSession
                         <-0x0000000100485760 smExecuteSession
                         <-0x000000010051c3e4 SessionThread
<-0x000000010000e958
                         StartThread <-0x0900000000286460 _pthread_body
(SESSION:
                         2838)
02/28/06   23:24:55      ANR9999D lmlcaud.c(506): ThreadId<75> Error 17
checking
                         filespace data for license audit. (PROCESS: 72)

02/28/06   23:24:55      ANR9999D ThreadId<75> issued message 9999 from:

                         <-0x000000010001bf74 outDiagf
<-0x00000001006d8e70
                         LmLcAuditThread <-0x000000010000e958 StartThread

                         <-0x0900000000286460 _pthread_body  (PROCESS:
72)
03/01/06   11:20:55      ANR9999D lmlcaud.c(506): ThreadId<43> Error 17
checking
                         filespace data for license audit. (PROCESS: 79)

03/01/06   11:20:55      ANR9999D ThreadId<43> issued message 9999 from:

                         <-0x000000010001bf74 outDiagf
<-0x00000001006d8e70
                         LmLcAuditThread <-0x000000010000e958 StartThread

                         <-0x0900000000286460 _pthread_body  (PROCESS:
79)
03/03/06   03:41:10      ANR9999D lmlcaud.c(506): ThreadId<51> Error 17
checking
                         filespace data for license audit. (PROCESS: 29)

03/03/06   03:41:10      ANR9999D ThreadId<51> issued message 9999 from:

                         <-0x000000010001bf74 outDiagf
<-0x00000001006d8e70
                         LmLcAuditThread <-0x000000010000e958 StartThread

                         <-0x0900000000286460 _pthread_body  (PROCESS:
29)

In each case we need to halt and restart the TSM server to free up the
locks.  Finding slack time to do that is not always easy.





"Ochs, Duane" <Duane.Ochs AT QG DOT COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
01/30/2006 12:44 PM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
[ADSM-L] dsmserv process hung.






AIX 5.3
TSM 5.3.1.2
This weekend one of my three TSM servers had the DSMSERV process hang.
The machine was accessible, the DSMSERV process still existed. It was
still accepting connections but not talking to them. In turn our cross
server backups and volume reconciliation hung from the the other 2 TSM
servers. One server ended up crashing due to a full recovery log. The
other was near that same point. Looks like the root cause was a full
recovery log on the hung server.

I monitor to see if DSMSERV exists, I monitor for backup and archive
failures. I use operational reporting to give me additional information
for clients. I even monitor to make sure the client scheduler is running
and communicating.

Does anybody have a method in place or an idea to monitor if the TSM
server is actually capable of communication ?

Duane Ochs
Information Systems - Enterprise Computing
Quad/Graphics Inc.
Sussex, Wisconsin
414-566-2375 phone
414-566-4010 pin# 2375 beeper
Duane.Ochs AT qg DOT com
www.QG.com <outbind://8/www.QG.com>


<Prev in Thread] Current Thread [Next in Thread>