TSM Server abruptly stopping

Here is the db2diag log from the above time period (17:42:00).
 

Attachments

  • db2diag_2017-06-08-17-42-11.txt
    1.1 MB · Views: 3
Ok. Stepping out of my area of expertise. It recognises that it needs to do a database backup and attempts to do one.


An error occurs. Something to do with database history? -
2017-06-08-17.42.27.345756-300 E24201089E365 LEVEL: Error
PID : 11448 TID : 140737160476448PROC : db2vend (DB2 Prune Agent - 154 (
INSTANCE: tsminst1 NODE : 000
FUNCTION: DB2 UDB, database utilities, sqluvdel, probe:1406
DATA #1 : TSM RC, PD_DB2_TYPE_TSM_RC, 4 bytes
TSM RC=0x00000089=137 -- see TSM API Reference for meaning.

Database backup fails - error accessing media
2017-06-08-17.42.27.351358-300 E24201455E948 LEVEL: Error
PID : 4550 TID : 140657230341888PROC : db2sysc 0
INSTANCE: tsminst1 NODE : 000 DB : TSMDB1
APPHDL : 0-1357 APPID: *LOCAL.tsminst1.170608223008
AUTHID : TSMINST1
EDUID : 154 EDUNAME: db2agent (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluhVendorDelete, probe:1304
MESSAGE : SQL2062N An error occurred while accessing media "". Reason code:
"".
DATA #1 : String, 32 bytes
Error returned by vendor delete.
DATA #2 : Vendor RC, PD_DB2_TYPE_VENDOR_RC, 4 bytes
Vendor RC=0x0000001A=26 -- see DB2 API Guide for meaning.
DATA #3 : Hexdump, 48 bytes
0x00007FE8AEFD24B0 : 8900 0000 3134 3036 2031 3337 0000 0000 ....1406 137....
0x00007FE8AEFD24C0 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x00007FE8AEFD24D0 : 0000 0000 0000 0000 0000 0000 0000 0000 ................

Unable to delete a backup image from August 2016
2017-06-08-17.42.27.364295-300 E24202404E522 LEVEL: Error
PID : 4550 TID : 140657230341888PROC : db2sysc 0
INSTANCE: tsminst1 NODE : 000 DB : TSMDB1
APPHDL : 0-1357 APPID: *LOCAL.tsminst1.170608223008
AUTHID : TSMINST1
EDUID : 154 EDUNAME: db2agent (TSMDB1) 0
FUNCTION: DB2 UDB, database utilities, sqluhPruneHistoryDeleteFile, probe:1570
MESSAGE : ADM8507N Unable to delete the backup image with timestamp
"20160830113109".

Then these messages cycle around.

It may be a support call is required but then there are some really clever people on this forum. First place I would look is space for the database backup. If that is ok, then pruning the DB2 history may be beneficial. A DB2 DBA would be helpful now...
 
I also noticed those errors. These logs I have posted from the 17:42 timeframe are when the db backup was running. This evening at about 20:07 it failed again, but I didn't find any "Unable to delete the backup image" error messages.
I did run a "db2 prune logfile prior to S0006490.LOG" about a week ago. Maybe I should have done a "
db2 prune history" instead?

I think we just got support, so will log a support call tomorrow if I can't get it resolved.

I have uploaded the db2diag from about 20:07 today for reference.
 

Attachments

  • db2diag_2017-06-08-20-07-11.txt
    417 KB · Views: 2
Ok, this last extract looks worse...

I presume you didn't run db2dart manually? This is a database consistency/integrity tool.

With the writing to db2dump and the db2dart program running, I think your DB is in bad shape. You need the support call. Still have the database backup from when it was working well?

2017-06-08-20.07.11.780863-300 I26460648E755 LEVEL: Severe
PID : 13903 TID : 140657180010240PROC : db2sysc 0
INSTANCE: tsminst1 NODE : 000 DB : TSMDB1
APPHDL : 0-1139 APPID: *LOCAL.tsminst1.170609010329
AUTHID : TSMINST1
EDUID : 198 EDUNAME: db2agent (TSMDB1) 0
FUNCTION: DB2 UDB, trace services, sqlt_logerr_string (secondary logging fu, probe:0
DATA #1 : String, 173 bytes
Running the following command before and after restart database
may provide useful problem determination information.
Replace DBNAME with the real DB name if not provided.
DATA #2 : String, 95 bytes
db2dart TSMDB1 /di /tsi 5 /oi 5 /ps 131082 /np 100 /v y /scr n /rptn iTsi5oi5pg131082o+99.rpt
 
So it stopped again. dsmserv.err has nothing added except for one line this morning, nothing for this evenings stop.
The callstack from the core you posted earlier, is it from this latest crash or from before you updated the ulimit?
 
The "getcoreinfo.tx" was from before the ulimit change. The two db2diag text files I uploaded were from after the ulimit change.
 
The "getcoreinfo.tx" was from before the ulimit change. The two db2diag text files I uploaded were from after the ulimit change.
I'd like to see the callstack from the latest core. Check /var/log/messages for the name and date/time of the last core:
grep "Saved core dump" /var/log/messages
 
This was the last core:
Jun 6 22:53:29 tsm abrt[5780]: Saved core dump of pid 2359 (/opt/tivoli/tsm/server/bin/dsmserv) to /var/spool/abrt/ccpp-2017-06-06-22:53:28-2359 (168144896 bytes)

The getcoreinfo.txt that I posted earlier was from June 4th, so I will post the getcoreinfo.txt from this last core. The getcoreinfo.txt is the callstack that you are referring to correct?
 
Here is the getcoreinfo.txt from the last core.
 

Attachments

  • getcoreinfo.txt
    63.6 KB · Views: 2
Anyone see an issue with the log location or permissions:

I show that /home/tsminst1/dsmserv.opt has the values of:

ACTIVELOGDirectory /home/tsminst1/activelog
ACTIVELOGSize 8192
ARCHLOGDirectory /home/tsminst1/archlog


I show that “db2 get db cfg for TSMDB1 | grep -i log” has values of:

Path to log files = /home/tsminst1/activelog/NODE0000/

Overflow log path (OVERFLOWLOGPATH) = /home/tsminst1/archlog/RstDbLog/NODE0000/

First log archive method (LOGARCHMETH1) = DISK:/home/tsminst1/archlog/archmeth1/

Failover log archive path (FAILARCHPATH) = /home/tsminst1/archlog/failarch/


[tsminst1@tsm ~]$ ls -alh

total 1.5G

drwxr-xr-x 16 tsminst1 tsminst1 4.0K Jun 9 14:54 .
drwxr-xr-x. 4 root root 4.0K Aug 4 2015 ..
drwxrwxr-x 3 tsminst1 tsminst1 4.0K Jul 2 2015 activelog
-rw-rw-r-- 1 tsminst1 tsminst1 1.3M Jun 9 09:34 actlog.out
drwxrwxr-x 2 tsminst1 tsminst1 4.0K Jun 9 12:12 archivelog
drwxrwxr-x 5 tsminst1 tsminst1 4.0K Jul 2 2015 archlog


[tsminst1@tsm ~]$ ls /home/tsminst1/activelog/NODE0000/ -alh
total 8.1G
drwxr-x--- 2 tsminst1 tsminst1 4.0K Jun 9 15:17 .
drwxrwxr-x 3 tsminst1 tsminst1 4.0K Jul 2 2015 ..
-rw------- 1 tsminst1 tsminst1 513M Jun 9 15:51 S0006549.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 8 07:20 S0006550.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 8 10:12 S0006551.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 8 13:27 S0006552.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 8 13:41 S0006553.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 8 17:56 S0006554.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 8 18:20 S0006555.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 8 18:21 S0006556.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 8 21:43 S0006557.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 9 09:21 S0006558.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 9 11:16 S0006559.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 9 13:17 S0006560.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 9 14:33 S0006561.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 9 14:53 S0006562.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 9 15:02 S0006563.LOG
-rw------- 1 tsminst1 tsminst1 513M Jun 9 15:17 S0006564.LOG
-rw------- 1 tsminst1 tsminst1 512 Jun 3 20:05 SQLLPATH.TAG


[tsminst1@tsm ~]$ ls /home/tsminst1/archlog/archmeth1/ -alh
total 12K
drwxr-x--- 3 tsminst1 tsminst1 4.0K Jul 2 2015 .
drwxrwxr-x 5 tsminst1 tsminst1 4.0K Jul 2 2015 ..
drwxr-x--- 3 tsminst1 tsminst1 4.0K Jul 2 2015 tsminst1


[tsminst1@tsm ~]$ ls /home/tsminst1/archlog/archmeth1/tsminst1/TSMDB1/NODE0000/C0000000/ -alh
total 2.1G
drwxr-x--- 2 tsminst1 tsminst1 4.0K Jun 9 14:49 .
drwxr-x--- 3 tsminst1 tsminst1 4.0K Jul 2 2015 ..
-rw-r----- 1 tsminst1 tsminst1 513M Aug 29 2016 CanDeleteThisToMakeFreeSpace1.log (FYI – last year the file system filled up, so created these to be deleted if it happened again.)
-rw-r----- 1 tsminst1 tsminst1 513M Aug 29 2016 CanDeleteThisToMakeFreeSpace2.log
-rw-r----- 1 tsminst1 tsminst1 513M Aug 29 2016 CanDeleteThisToMakeFreeSpace3.log
-rw-r----- 1 tsminst1 tsminst1 513M Aug 29 2016 CanDeleteThisToMakeFreeSpace4.log
-rw-r----- 1 tsminst1 tsminst1 11M Jun 8 17:42 S0006538.LOG
-rw-r----- 1 tsminst1 tsminst1 552K Jun 8 17:54 S0006539.LOG
-rw-r----- 1 tsminst1 tsminst1 16K Jun 8 18:20 S0006540.LOG
-rw-r----- 1 tsminst1 tsminst1 17M Jun 8 21:42 S0006541.LOG
-rw-r----- 1 tsminst1 tsminst1 18M Jun 9 09:21 S0006542.LOG
-rw-r----- 1 tsminst1 tsminst1 18M Jun 9 11:15 S0006543.LOG
-rw-r----- 1 tsminst1 tsminst1 200K Jun 9 13:17 S0006544.LOG
-rw-r----- 1 tsminst1 tsminst1 7.8M Jun 9 13:25 S0006545.LOG
-rw-r----- 1 tsminst1 tsminst1 9.3M Jun 9 14:49 S0006546.LOG
[tsminst1@tsm ~]$
 
Update: I am doing a db restore back to several days ago due to db issues. Did the restore about 6 hours ago, and now the audit volume (fix=yes) is running.
 
Back
Top