ANR9999D's and TSM server crashing during file restore

tom.s

ADSM.ORG Member
Joined
Mar 1, 2005
Messages
74
Reaction score
0
Points
0
Location
London
Website
Visit site
Folks,



any help on the following would be appreciated...



I am running TSM 5.2.3.0 on Win2003. Recently TSM has been swamped by ANR9999D messages:

********************************************************

ANR9999D ssalloc.c(1378): ThreadId<62> Error locating storage pool

-6. Callchain: 104EAD79 outTextf()+1529 <- 10479E94

tsmInitializeServer()+472664 <- (SESSION: 17421, PROCESS: 343)

********************************************************



There would be around 7000 of them in the log every time expiration was run. Incidentally, the storage pool mentioned in the error does not exist, and I assume from the negative number it was a copy pool. A couple of days after this message appeared the server began to crash during certain client restore operations - particuler nodes, and particular bits of directory structure. If I restored the files without the directorys they come back OK. I suspect other things have caused it to crash as well, but I haven't managed to isolate what.



A DSMSERV AUDIT DB FIX=YES DETAIL=YES found a lot of stuff wrong and fixed it, however the server still crashes the same client restore.



The windows event log contains the following:

********************************************************

Event Type: Error

Event Source: ADSMServer

Event Category: None

Event ID: 27

Date: 27/04/2005

Time: 13:34:48

User: N/A

Computer: XIGBRS83

Description:

TSM Server Diagnostic: ANR9999D: ADSM Exception Information: file = pkthread.c, line = 2253,Code = c0000005, Address = 7C34FEDC

Attempt to read data at address 30~

********************************************************



Finally, though it doesn't seem to be causing the crash, the following ANR9999 message has appeared in the actlog since the audit:

********************************************************

ANR9999D smtrans.c(1397): ThreadId<53> Object header size

mismatch, replacing 377 with 375. Callchain: 104EAD79

outTextf()+1529 <- 103F293C tsmInitializeServer()+3EB10C

<- 43534544 Unknown <- (SESSION: 4)

********************************************************



I have raised this with support... but so far so frustrating!



If anybody has seen anything like this, I'd be grateful to hear from you.



Also, am I right in thinking that an UNLOAD, LOAD would can solve database corruption issues that an AUDIT can't?



Thanks in advance!



Tom
 
OK. Has anybody encountered TSM DB errors that a dump, load, audit didn't fix? The server keeps nose diving with an ANR9999 during particular client restores.



If so, any thoughts on a resolution other than exporting ALL of the nodes to another server?



:sad:
 
When did this start?



Did you call support?



ANR9999D is a generic error TSM thorws wehn it encounters an error that it can't identify. :sad:



Do all your stg pools look ok?



are there any other errors in the actlog surrounding this error that can give you clues?



What processes are running when this happens? Is it totally random?



We need more info.
 
Hi,



the ANR9999 no longer appears in the TSM act log, only in the Windows event log. There is one client that, if I try to restore files it's fine, but if I restore files and directory structure, the TSM server seems to suffer a memory leek and goes down leaving the following in the event log.....



Event Type: Error

Event Source: ADSMServer

Event Category: None

Event ID: 27

Date: 03/05/2005

Time: 09:37:18

User: N/A

Computer: XIGBRS83

Description:

TSM Server Diagnostic: ANR9999D: ADSM Exception Information: file =

pkthread.c, line = 2253,Code = c0000005, Address = 7C34FEDC

Attempt to read data at address 30~



The restore is into a different path, so no problem with permissions... when the TSM server comes back up the client reopens the session and the restore completes!



There have been several server crashes in the early hours of the morning when reclamation/backup stg etc... has been going on. It looks like the same kind of crash but I can't pin down what is triggering it so well as the above.



This originally started with an issue on the 5.2.3.0 server level. If you delete a storage pool in the wrong way you get left with pointers in the TSM DB still refering to it. A dump/load/audit seems to have fixed that but still the server crashes.



The problem has been going on a little while now, what with all the repeated audits it's been dragged out to a couple of weeks. I've been in touch with IBM support and so far they've been great, but I still have this crashing server. The storage pools now look OK. Expiration doesn't complain, there aren't any suspicious looking errors in the actlog, not even at the time of the message.



The feeling with IBM is that if a dump/load didn't fix it then it probably isn't the database, hence I'm ugrading all the firmware on the box and thinking about reinstalling TSM. But the way it crashes every time on the same restore makes me lean towards the database.



Thanks for the interest by the way.



:)
 
Tom,

I ran into a problem a few months ago where the client could cause a server to crash during certain client conditions - like terminating the client session abnormally. I forget the specifics but it was with a 5.1 or 5.2 client on Unix. I don't recall if the windows client caused the same effect.

You may want to just check your client versions and ensure they are updated or at least on the same level as known good clients. I have found that sometimes a fix introduces new - sometimes unrelated problems.



cheers,

neil
 
Neil,



thanks for the suggestion. The client is running 5.3.0.0. I hadn't had any problems with that version up until now. Moving it up a patch level or two may well be worth doing.



Regards,



Tom
 
All,



An EXPORT/IMPORT NODE seems to have fixed things.



IBM support ran a trace and found that the ANR9999D errors were reporting trouble locating a stgpool or trouble with a header size mismatch. An audit of the storage pool volumes though came back clean. For the sake of completeness.... the storage pool being used by the client causing the crashes is cached.



Hopefully, all is well with TSM once again.



Thanks to Neil and ITDrew for taking the time to post on this.



:)



Tom
 
Back
Top