TSM Server abruptly stopping

crowder

ADSM.ORG Member
Joined
Dec 8, 2013
Messages
27
Reaction score
0
Points
0
Hello,

I have a TSM server that was running fine, but all of a sudden it has started to abruptly stop. I can reboot the server, start the dsmserv.rc service, and it will start processing backups. Then, about 1.5 hours into the backups, it stops. There are 2 backup schedules for this TSM. Two things I have noticed: 1. If I reboot the server in the morning, by the time the evening (6pm) backups start, it dsmserv will stop quickly. 2. If I reboot the server right before backups start, then it will run for about 1.5-3 hours before failing, which happens to be after backup2 starts.


Backup1 inspects about 120,000 files, inspects about 1.7TB. Backups about 100 files at about 75GB total.
Backup2 inspects about 700,000 files, inspects about 2.8TB. Backups about 200 files at about 90GB total.

TSM Info:
IBM Tivoli Storage Manager
Command Line Administrative Interface - Version 6, Release 3, Level 0.0
(c) Copyright by IBM Corporation and other(s) 1990, 2011. All Rights Reserved.

Server Version 6, Release 3, Level 4.0

Linus Info:
CentOS release 6.5 (Final)
Linux tsm 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

/var/log/messages info:
Jun 6 21:08:13 tsm auditd[1783]: Audit daemon rotating log files
Jun 6 22:53:28 tsm kernel: dsmserv[2901]: segfault at 198 ip 0000000000abb78b sp 00007fffc27c05d0 error 4 in dsmserv[400000+e95000]
Jun 6 22:53:29 tsm abrtd: Directory 'ccpp-2017-06-06-22:53:28-2359' creation detected
Jun 6 22:53:29 tsm abrt[5780]: Saved core dump of pid 2359 (/opt/tivoli/tsm/server/bin/dsmserv) to /var/spool/abrt/ccpp-2017-06-06-22:53:28-2359 (168144896 bytes)
Jun 6 22:53:29 tsm abrtd: Package 'TIVsm-server' isn't signed with proper key
Jun 6 22:53:29 tsm abrtd: 'post-create' on '/var/spool/abrt/ccpp-2017-06-06-22:53:28-2359' exited with 1
Jun 6 22:53:29 tsm abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2017-06-06-22:53:28-2359'

Let me know what other information I can provide to help.

Thank you,
Michael
 
DB info:

DB21085I Instance "abc" uses "64" bits and DB2 code release "SQL09076"
with level identifier "08070107".
Informational tokens are "DB2 v9.7.0.6", "special_29869", "IP23328_29869", and
Fix Pack "6".
Product is installed at "/opt/tivoli/tsm/db2".
 
How do I view open files?

This TSM only backs up linux file systems, no windows file systems.

TSM user: ulimit -Hu = 773682
Root Linux user: 773682

/etc/security/limits.d/90-nproc.conf had:
* soft nproc 1024
root soft nproc unlimited

I just added to 90-nproc.conf:
tsminst1 - nproc 16384


Backup01:
Node Name: Backup01
Platform: Linux x86-64
Client OS Level: 2.6.32-358.el6.x86_
Client Version: Version 6, release 3, level 0.0
Policy Domain Name: FILESERVER
Last Access Date/Time: 06/07/2017 11:13:25
Days Since Last Access: <1
Password Set Date/Time: 04/26/2017 21:51:03
Days Since Password Set: 42
Invalid Sign-on Count: 0
Locked?: No
Contact: Linux Group
Compression: Client
Archive Delete Allowed?: Yes
Backup Delete Allowed?: Yes
Registration Date/Time: 08/04/2015 15:27:13
Node Type: Client
Keep Mount Point?: No
Maximum Mount Points Allowed: 4
Auto Filespace Rename : No
Validate Protocol: No
Transaction Group Max: 0
Data Write Path: ANY
Data Read Path: ANY
Session Initiation: ClientOrServer



q status:
Server Name: TSM
Server host name or IP address:
Server TCP/IP port number: 1500
Crossdefine: Off
Server Password Set: No
Server Installation Date/Time: 07/02/2015 15:28:21
Server Restart Date/Time: 06/07/2017 12:18:23
Authentication: On
Password Expiration Period: 90 Day(s)
Invalid Sign-on Attempt Limit: 0
Minimum Password Length: 0
Registration: Closed
Subfile Backup: No
Availability: Enabled
Inbound Sessions Disabled:
Outbound Sessions Disabled:
Accounting: Off
Activity Log Retention: 30 Day(s)
Activity Log Number of Records: 48961
Activity Log Size: 1 M
Activity Summary Retention Period: 30 Day(s)
License Audit Period: 30 Day(s)
Last License Audit: 05/12/2017 09:05:21
Server License Compliance: Valid
Central Scheduler: Disabled
Maximum Sessions: 25
Maximum Scheduled Sessions: 12
Event Record Retention Period: 10 Day(s)
Client Action Duration: 5 Day(s)
Schedule Randomization Percentage: 25
Query Schedule Period: Client
Maximum Command Retries: Client
Retry Period: Client
Client-side Deduplication Verification Level: 0 %
Scheduling Modes: Any
Active Receivers: CONSOLE ACTLOG
Configuration manager?: Off
Refresh interval: 60
Last refresh date/time:
Context Messaging: Off
Table of Contents (TOC) Load Retention: 120 Minute(s)
Machine Globally Unique ID: (I removed for post)
Archive Retention Protection: Off
Database Reporting Mode: Partial
Database Directories: /tsmdb/tsmdb001,/tsmdb/tsmdb002,/tsmdb/tsmdb003,/tsmdb/tsmdb004
Total Size of File System (MB): 90,713.99
Space Used on File System (MB): 67,958.01
Free Space Available (MB): 22,755.98
Encryption Strength: AES
Client CPU Information Refresh Interval: 180
Outbound Replication: Enabled
Target Replication Server:
Default Replication Rule for Archive: ALL_DATA
Default Replication Rule for Backup: ALL_DATA
Default Replication Rule for Space Management: ALL_DATA
Replication Record Retention Period: 30 Day(s)
LDAP User:
LDAP Password Set: No
Default Authentication: Local
 
What does the actlog show right before it stops? Do you have enough free active/archive log? Any other processes trying to run during that time frame?
 
No other processes that I am aware of. This server is only used for TSM. Backup of the db is done before anything else starts.

I believe there is enough free active/archive log. There is plenty of space if it goes over the 8GB.
ACTIVELOGSize 8192

Actlog just stops logging at the time of stop. When/if it stops tonight, I will post the tail end of the actlog.
 
Sounds like a core. Check dsmffdc.log and/or dsmserv.err in the instance directory.

Get the call stack from the core dump using:
#cd /opt/tivoli/tsm/server/bin
# getcoreinfo ./dsmserv /path/to/core

Note :
Replace /path/to/core with the actual full path file name of the generated core file (you had the path in /var/log/messages in one of your earlier post)
The getcoreinfo command will generate the getcoreinfo.txt and getcoreinfo-shlibs.tar.gz files in the current directory.
source: http://www-01.ibm.com/support/docview.wss?uid=swg21232317

Look at getcoreinfo.txt for the callstack. Take one or two of the functions and do a search on the Spectrum Protect Support Page to see if there are matching APARs.

As an example, getcoreinfo.txt will look like this (functions and buffers will be different), so you could pick the functions in line 1 and 2 (in red) to do your search:
#0 0x0000000000a660af in icGetDBBackupVolList (isFile=False, dbBackupVolListP=0x7fffc1659db8) at icutil.c:2254
#1 0x0000000000a3ea99 in ShareAudit (auditP=0x7fffa00275ce) at mmsshr.c:1930
#2 0x00000000009d245a in AuditThread (argP=<value optimized out>) at mmslib.c:15857
#3 0x0000000000dfc3e6 in StartThread (startInfoP=0x7fff7c061008) at pkthread.c:3369
#4 0x000000392a0079d1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000039298e88fd in clone () from /lib64/libc.so.6
 
moon-buddy:

[tsminst1@tsm bin]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 773682
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16384
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

[tsminst1@tsm bin]$ date
Thu Jun 8 09:20:26 CDT 2017
 
RecoveryOne: This is what the actlog said when TSM started having problems.

06/08/2017 07:48:29 ANR0407I Session 9 started for administrator MIKE (Linux
x86-64) (Tcp/Ip *tsm.*.com(35170)). (SESSION: 9)
06/08/2017 07:48:29 ANR2017I Administrator MIKE issued command: QUERY SESSION
(SESSION: 9)
06/08/2017 07:49:57 ANR0229W Server is unable to add entries to the Activity
Log. Console messages will not be logged until database
access is available.
06/08/2017 07:50:04 ANR0106E admattrm.c(396): Unexpected error 4522 fetching
row in table "Global.Attributes".
06/08/2017 07:50:04 ANR0106E cscmdsch.c(555): Unexpected error 4522 fetching
row in table "Schedule.Pending".
06/08/2017 07:50:04 ANR9999D_4073936223 CsCmdSchedulerThread(cscmdsch.c:317)
Thread<68>: Invalid recovery criteria (9999) in the
central scheduler - the task is terminating.
06/08/2017 07:50:04 ANR9999D Thread<68> issued message 9999 from:
06/08/2017 07:50:04 ANR9999D Thread<68> 0x00000000dc15a3 OutDiagToCons
06/08/2017 07:50:04 ANR9999D Thread<68> 0x00000000dc43a5 outDiagfExt
06/08/2017 07:50:04 ANR9999D Thread<68> 0x000000007a5a3a CsCmdSchedulerThread
06/08/2017 07:50:04 ANR9999D Thread<68> 0x00000000e5905a StartThread
06/08/2017 07:50:04 ANR9999D Thread<68> 0x00003bb44079d1 *UNKNOWN*
06/08/2017 07:50:04 ANR9999D Thread<68> 0x00003bb40e8b6d *UNKNOWN*
06/08/2017 07:50:04 ANR9999D_1185457118 GetGlobalReorgVal(tbreorg.c:6690)
Thread<71>: Unexpected rc 2427 from admFetchAttr for
REORG_STARTDELAY.
06/08/2017 07:50:04 ANR9999D Thread<71> issued message 9999 from:
06/08/2017 07:50:04 ANR9999D Thread<71> 0x00000000dc15a3 OutDiagToCons
06/08/2017 07:50:04 ANR9999D Thread<71> 0x00000000dc43a5 outDiagfExt
06/08/2017 07:50:04 ANR9999D Thread<71> 0x00000000afa6a8 GetGlobalReorgVal
06/08/2017 07:50:04 ANR9999D Thread<71> 0x00000000aff9f0 RdbReorg
06/08/2017 07:50:04 ANR9999D Thread<71> 0x00000000ae4418
RdbMonitorStatsThread
06/08/2017 07:50:04 ANR9999D Thread<71> 0x00000000e5905a StartThread
06/08/2017 07:50:04 ANR9999D Thread<71> 0x00003bb44079d1 *UNKNOWN*
06/08/2017 07:50:04 ANR9999D Thread<71> 0x00003bb40e8b6d *UNKNOWN*
06/08/2017 07:50:38 ANR2104I Activity log processing is now restarted.
06/08/2017 07:50:44 ANR0101E bfdedup.c(9813): Error 4522 opening table
"BF.Dereferenced.Chunks".
06/08/2017 07:50:44 ANR0171I dbitxn.c(731): Error detected on 0:2, database in
evaluation mode.
06/08/2017 07:50:44 ANR0162W Supplemental database diagnostic information:
-1:08003:-99999 ([IBM][CLI Driver] CLI0106E Connection
is closed. SQLSTATE=08003).
06/08/2017 07:56:24 ANR0482W Session 7 for node BACKUP02_VNX (Linux
x86-64) terminated - idle for more than 15 minutes.
(SESSION: 7)
06/08/2017 07:56:24 ANR0171I dbitxn.c(731): Error detected on 0:33, database
in evaluation mode. (SESSION: 7)
06/08/2017 07:56:24 ANR0171I dbitxn.c(731): Error detected on 0:34, database
in evaluation mode. (SESSION: 7)
06/08/2017 08:02:01 ANR0407I Session 18 started for administrator TSMINST1
(Linux x86-64) (Tcp/Ip *tsm.*.com(35181)). (SESSION:
18)
06/08/2017 08:12:48 ANR0171I dsalloc.c(3685): Error detected on 36:7, database
in evaluation mode. (SESSION: 8)
06/08/2017 08:12:48 ANR0157W Database operation INSERT for table DS.Overflow
failed with result code 4522 and tracking ID: 0x1a5f278.
(SESSION: 8)
06/08/2017 08:12:48 ANR0158W Database operation INSERT for table DS.Overflow
failed with operation code 4522 and tracking id
0x1a5f278. The data for column 0 is: (int32)3. (SESSION:
8)
06/08/2017 08:12:48 ANR0158W Database operation INSERT for table DS.Overflow
failed with operation code 4522 and tracking id
0x1a5f278. The data for column 1 is: (int32)31. (SESSION:
8)
06/08/2017 08:12:48 ANR0158W Database operation INSERT for table DS.Overflow
failed with operation code 4522 and tracking id
0x1a5f278. The data for column 2 is: (int32)0. (SESSION:
8)
06/08/2017 08:12:48 ANR0102E dsalloc.c(3698): Error 4522 inserting row in
table "DS.Overflow". (SESSION: 8)
06/08/2017 08:12:48 ANR0530W Transaction failed for session 8 for node
BACKUP02_VNX (Linux x86-64) - internal server
error detected. (SESSION: 8)
06/08/2017 08:12:48 ANR0403I Session 8 ended for node BACKUP02_VNX
(Linux x86-64). (SESSION: 8)
06/08/2017 08:12:48 ANR0171I lmutil.c(1256): Error detected on 35:10, database
in evaluation mode. (SESSION: 8)
06/08/2017 08:12:48 ANR0157W Database operation FETCH for table
License.Details failed with result code 4522 and tracking
ID: 0x7fffe00491d8. (SESSION: 8)
06/08/2017 08:12:48 ANR0158W Database operation FETCH for table
License.Details failed with operation code 4522 and
tracking id 0x7fffe00491d8. The data for column 0 is:
(int16)27. (SESSION: 8)
06/08/2017 08:12:48 ANR0158W Database operation FETCH for table
License.Details failed with operation code 4522 and
tracking id 0x7fffe00491d8. The data for column 1 is:
(string, len=20)0x44434241434B555030325F564E585F41524348-
31. (SESSION: 8)
06/08/2017 08:12:48 ANR0157W Database operation DELETE for table
License.Details failed with result code 4522 and tracking
ID: 0x7fffe00491d8. (SESSION: 8)
06/08/2017 08:12:48 ANR0158W Database operation DELETE for table
License.Details failed with operation code 4522 and
tracking id 0x7fffe00491d8. The data for column 0 is:
(int16)27. (SESSION: 8)
06/08/2017 08:12:48 ANR0158W Database operation DELETE for table
License.Details failed with operation code 4522 and
tracking id 0x7fffe00491d8. The data for column 1 is:
(string, len=20)0x44434241434B555030325F564E585F41524348-
31. (SESSION: 8)
06/08/2017 08:12:48 ANR0106E lmutil.c(1283): Unexpected error 4522 fetching
row in table "License.Details". (SESSION: 8)
06/08/2017 08:21:02 ANR0171I dbitxn.c(731): Error detected on 0:24, database
in evaluation mode.
06/08/2017 08:21:02 ANR3619W The user limit for open files is below the
recommended minimum value of 8192.
06/08/2017 08:21:05 ANR0171I bfshred.c(3316): Error detected on 31:2, database
in evaluation mode.
06/08/2017 08:21:05 ANR0106E bfshred.c(3855): Unexpected error 4522 fetching
row in table "BF.Shred.Bitfiles".
06/08/2017 08:22:04 ANR0171I tbrsql.c(2798): Error detected on 9:2, database
in evaluation mode.
06/08/2017 08:22:04 ANR0101E bfdedup.c(10499): Error 4522 opening table
"BF.Queued.Chunks".
06/08/2017 08:22:04 ANR0171I dbitxn.c(731): Error detected on 0:9, database in
evaluation mode.
06/08/2017 08:22:04 ANR0162W Supplemental database diagnostic information:
-1:08003:-99999 ([IBM][CLI Driver] CLI0106E Connection
is closed. SQLSTATE=08003).
06/08/2017 08:22:04 ANR0171I tbrsql.c(2798): Error detected on 14:2, database
in evaluation mode.
06/08/2017 08:22:04 ANR0101E bfdedup.c(10499): Error 4522 opening table
"BF.Queued.Chunks".
06/08/2017 08:22:04 ANR0171I dbitxn.c(731): Error detected on 0:14, database
in evaluation mode.
06/08/2017 08:22:04 ANR0162W Supplemental database diagnostic information:
-1:08003:-99999 ([IBM][CLI Driver] CLI0106E Connection
is closed. SQLSTATE=08003).
06/08/2017 08:36:39 ANR0171I dbitxn.c(731): Error detected on 0:32, database
in evaluation mode. (SESSION: 9)
06/08/2017 08:36:39 ANR0171I dbitxn.c(731): Error detected on 0:30, database
in evaluation mode. (SESSION: 9)
 
moon-buddy:

[tsminst1@tsm bin]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 773682
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16384
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

[tsminst1@tsm bin]$ date
Thu Jun 8 09:20:26 CDT 2017

Increase open files to 16384
 
/etc/security/limits.conf has:
tsminst1 soft nofile 4096
tsminst1 hard nofile 10240

/etc/sysctl.conf has:
fs.file-max = 16384


So, udpate
tsminst1 hard nofile 10240
to be
tsminst1 hard nofile 16384
 
[tsminst1@tsm ~]$ ulimit -Hn
16384
[tsminst1@tsm ~]$ ulimit -Sn
4096
 
Rebooted server, logged back in, started TSM server and get the following:

[tsminst1@tsm ~]$ ulimit -Sn
4096
[tsminst1@tsm ~]$ ulimit -Hn
16384
[tsminst1@tsm ~]$ ulimit -n
4096
[tsminst1@tsm ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 773682
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16384
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

vs root:

[root@tsm ~]# ulimit -Sn
1024
[root@tsm ~]# ulimit -Hn
4096
[root@tsm ~]# ulimit -n
1024
[root@tsm ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 773682
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 773682
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
 
marclant: Info from getcoreinfo.txt and I am looking up functions as you suggested.


[Thread debugging using libthread_db enabled]
Core was generated by `/opt/tivoli/tsm/server/bin/dsmserv -u tsminst1 -i /home/tsminst1 -q'.
Program terminated with signal 6, Aborted.
#0 0x0000003bb4032925 in raise () from /lib64/libc.so.6

=========================================================================

#0 0x0000003bb4032925 in raise () from /lib64/libc.so.6
#1 0x0000003bb4034105 in abort () from /lib64/libc.so.6
#2 0x0000000000e65f46 in PsAbortServer (argP=<value optimized out>) at psthread.c:538
#3 0x0000000000e58385 in pkAbort (abortMsgP=0x0) at pkthread.c:680
#4 0x0000000000e5ac05 in pkAcquireMutexTracked (mutexP=0x1729b50, file=0x11d2a32 "pkthread.c", line=2404) at pkmon.c:631
#5 0x0000000000e572e1 in PkShowThreadsInt (stream=0x0, detail=2) at pkthread.c:2404
#6 0x0000000000e5ad38 in pkAcquireMutexTracked (mutexP=0x171c4c8, file=0x118a15b "output.c", line=6652) at pkmon.c:627
#7 0x0000000000dbe7cc in StdPutText (fmtStr=0x7fff88018604 "Server PID: 30059~~", argP=0x0, myStream=0x0, msgType=OUTTYPE_TEXT) at output.c:6652
#8 0x0000000000dc4171 in outPrintf (stream=0x0, fmtStr=<value optimized out>) at outvarg.c:267
#9 0x0000000000e572f9 in PkShowThreadsInt (stream=0x0, detail=2) at pkthread.c:2407
#10 0x0000000000e5ad38 in pkAcquireMutexTracked (mutexP=0x171c4c8, file=0x118a15b "output.c", line=6652) at pkmon.c:627
#11 0x0000000000dbe7cc in StdPutText (fmtStr=0x7fffc29af980 "ANR9999D_1285327590 pkLogicAbort(pkthread.c:713) Thread<1593>: Run-time assertion failed: \"listP != NULL\", Thread 1593 (tid 140736458336000), File outinit.c, Line 1641, reason: OUT023.~", argP=0x0, myStream=0x0, msgType=OUTTYPE_MESSAGE) at output.c:6652
#12 0x0000000000dc15a3 in OutDiagToCons (crc=1285327590, func=<value optimized out>, srcFile=0x11d2a32 "pkthread.c", srcLine=713, diagMsgP=0x7fff88014ff1 "Run-time assertion failed: \"listP != NULL\", Thread 1593 (tid 140736458336000), File outinit.c, Line 1641, reason: OUT023.~") at output.c:1315
#13 0x0000000000dc43a5 in outDiagfExt (func=0x11d384d "pkLogicAbort", srcFile=0x11d2a32 "pkthread.c", srcLine=713, fmtStr=0x11d34e0 "Run-time assertion failed: \"%s\", Thread %u (tid %s), File %s, Line %u, reason: %s.~") at outvarg.c:223
#14 0x0000000000e59c20 in pkLogicAbort (exprP=0x11be7a9 "listP != NULL", fileP=0x11be74e "outinit.c", lineNum=1641, abortMsgP=0x11be7a2 "OUT023") at pkthread.c:711
#15 0x0000000000dd2211 in outCloseStream (stream=0x7fffa8013048) at outinit.c:1641
#16 0x00000000004b0d99 in FinishActLogThread (insCtlP=<value optimized out>, closeStream=True, inRc=-1, file=<value optimized out>, lineNum=<value optimized out>) at admactlg.c:4907
#17 0x00000000004b53b9 in AdmActivityLogThread (notused=<value optimized out>) at admactlg.c:3320
#18 0x0000000000e5905a in StartThread (startInfoP=0x7fff94001c78) at pkthread.c:3372
#19 0x0000003bb44079d1 in start_thread () from /lib64/libpthread.so.0
#20 0x0000003bb40e8b6d in clone () from /lib64/libc.so.6
 
Rebooted server, logged back in, started TSM server and get the following:

[tsminst1@tsm ~]$ ulimit -Sn
4096
[tsminst1@tsm ~]$ ulimit -Hn
16384
[tsminst1@tsm ~]$ ulimit -n
4096
[tsminst1@tsm ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 773682
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16384
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

vs root:

[root@tsm ~]# ulimit -Sn
1024
[root@tsm ~]# ulimit -Hn
4096
[root@tsm ~]# ulimit -n
1024
[root@tsm ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 773682
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 773682
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Try running your operations to see if this fixes the issue.
 
Server locked up, so rebooting it. A quick way for me to know when it has hung is by issuing a "q sched" and see if the schedule is returned or it the system hangs.
 
So it stopped again. dsmserv.err has nothing added except for one line this morning, nothing for this evenings stop.

lin_tape.errorlog has several of these:
IBMtape0-----00093 Thu Jun 8 18:30:23 2017
Scsi Path : 00 00 00 01
CDB Command : 5A 08 24 00 00 00 00 28 08 00
Status Code : 08 00 00 01
Sense Data : 70 00 05 00 00 00 00 1C 00 00 00 00 24 00 00 CD
00 02 00 00 00 00 20 20 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Description : Illegal Request


/var/log/messages had:
Jun 8 17:44:54 dctsm kernel: lin_tape: IBMtape0-----00093 tape_modesense10_page failed: -22
Jun 8 17:44:54 dctsm kernel: lin_tape: IBMtape0-----00093 driver_byte 08, host_byte 07, msg_byte 00, status_byte 01
Jun 8 17:44:59 dctsm kernel: lin_tape: IBMtape0-----00093 driver_byte 08, host_byte 07, msg_byte 00, status_byte 01
Jun 8 17:44:59 dctsm kernel: lin_tape: IBMtape0-----00093 tape_modesense10_page failed: -22
Jun 8 17:54:43 dctsm kernel: lin_tape: IBMtape0-----00093 driver_byte 08, host_byte 07, msg_byte 00, status_byte 01

Actlog had:
06/08/2017 17:42:24 ANR0406I Session 16 started for node $$_TSMDBMGR_$$
(DB2/LINUXX8664) (Tcp/Ip localhost(48274)). (SESSION: 16)
06/08/2017 17:42:26 ANR1361I Output volume /dbbackup/db/96961001.DBV closed.
(SESSION: 16)
06/08/2017 17:42:26 ANR0514I Session 16 closed volume
/dbbackup/db/96961001.DBV. (SESSION: 16)
06/08/2017 17:42:26 ANR0403I Session 16 ended for node $$_TSMDBMGR_$$
(DB2/LINUXX8664). (SESSION: 16)
06/08/2017 17:43:05 ANR0171I tbrsql.c(2798): Error detected on 6:2, database
in evaluation mode.
06/08/2017 17:43:05 ANR0101E bfdedup.c(10499): Error 4522 opening table
"BF.Queued.Chunks".
06/08/2017 17:43:05 ANR0171I dbitxn.c(731): Error detected on 0:6, database in
evaluation mode.
06/08/2017 17:43:05 ANR0162W Supplemental database diagnostic information:
-1:08003:-99999 ([IBM][CLI Driver] CLI0106E Connection
is closed. SQLSTATE=08003).
06/08/2017 17:43:05 ANR0171I tbrsql.c(2798): Error detected on 17:2, database
in evaluation mode.
more... (<ENTER> to continue, 'C' to cancel)

06/08/2017 17:43:05 ANR0101E bfdedup.c(10499): Error 4522 opening table
"BF.Queued.Chunks".
06/08/2017 17:43:05 ANR0171I dbitxn.c(731): Error detected on 0:17, database
in evaluation mode.
06/08/2017 17:43:05 ANR0162W Supplemental database diagnostic information:
-1:08003:-99999 ([IBM][CLI Driver] CLI0106E Connection
is closed. SQLSTATE=08003).
06/08/2017 17:43:53 ANR0171I icvolhst.c(3789): Error detected on 25:2,
database in evaluation mode.
06/08/2017 17:43:53 ANR0106E icvolhst.c(3854): Unexpected error 4522 fetching
row in table "Seq.Volume.History".
06/08/2017 17:43:53 ANR4538E The server could not write sequential volume
history information to the volhist.dat.20170608174353
temporary file.
06/08/2017 17:43:53 ANR4550I Full database backup (process 2) completed.
(SESSION: 14, PROCESS: 2)
06/08/2017 17:43:53 ANR2183W icvolhst.c(7709): Transaction 0:7270 was aborted.
06/08/2017 17:44:53 ANR0985I Process 2 for Database Backup running in the
FOREGROUND completed with completion state SUCCESS at
05:44:53 PM. (SESSION: 14, PROCESS: 2)
06/08/2017 17:44:53 ANR0405I Session 14 ended for administrator TSMINST1
(Linux x86-64). (SESSION: 14)
 
Need to look at db2diag.log for entries during that period.
 
Back
Top