TSM 6.2.1.0 on SLES11, DB2 on Ext3 performance?

okaldhol

Active Newcomer
Joined
Jan 8, 2007
Messages
9
Reaction score
0
Points
0
Location
Molde, Norway
Website
Visit site
Hi.

We recently migrated from TSM 5.5 on a Windows server to 6.2.1.0 on a new x3650M2 running SLES11 x64.

We are having some performance issues with the DB, the db2sysc process seems to completely saturate the IO on the disks. This doesnt seem quite right, the disks beeing 15K 146GB SAS drives, and Im suspecting we need to fiddle with the attributes on the file system. I have tried using the noatime option on the file system, but that doesnt seem to help too much.

We have no issues with the file pools, they are using XFS.

The server is running quite high load, we typically see load values of around 10-14, even when TSM is more or less idling.

Anyone have any experience with this, or any ideas?
 
how long have the server been running with 6.2? And how long have you had these issues?
 
We're running on RHEL5 x86_64 with ext3 - admittedly DB2 9.5 (TSM 6.1) rather than 9.7 for 6.2 - no unusual I/O load. Load values typically <1. A lot of threads, of course.

Is your I/O load happening during db2 table reorgs?
 
Hogmaster: its been running 6.2.1.0 for about two months or so. And its had the issues right from the start.

TonyB: Not sure, Im not experienced with DB2. I have started a case with IBM support to see if they can figure out whats going on, I will report back if they find a solution.
 
You may want to install "sysstats" (Linux package), and monitor your system with "sar" to try and find the bottleneck in the meantime. It's a very nifty tool, espacially combined with "ksar" (graphic data viewer). You may have a disk that is dragging you behind, I have had a high load problem before caused by a high I/O wait percentage related to a failing disk before. What's your disk configuration ? (Raid5/10 etc).
 
JeanSeb: Thanks for the tip, I'll see what I can get out of sar.

The server has 8 146GB 15K RPM SAS drives, in four RAID1 volumes. There is one RAID1 volume for each of the system, db, log and archive log partitions.

We have found out that the problem is high IO generated by the db2sysc process, and its typically using 98-100% disk IO on the DB disk, even when TSM is idle (as in: no processes running, and no backup sessions).

Yesterday we also had a crash on the database, which IBM is working on, the three final messages were:

25.11.2010 23:26:26 ANR0171I dbiconn.c(1485): Error detected on 0:134, database in evaluation mode.
25.11.2010 23:26:26 ANR0169E An unexpected error has occurred and the TSM server is stopping.
25.11.2010 23:26:26 ANR0162W Supplemental database diagnostic information: -1:57049:-1225 ([IBM][CLI Driver] SQL1225N The request failed because an operating system process, thread, or swap space limit was reached. SQLSTATE=57049

The database definately has some issues. Im waiting for support to come back to me with any further steps.
 
IBM Support didnt come up with anything so far, they asked me to reboot the server, which I did friday evening. It has been behaving erratically after this as well, and this morning it just crashed again.

Here are some interesting log entries from Sunday:

28.11.2010 08:59:00 ANR0159E tbrsql.c(1418): Database deadlock detected on 46:3.
28.11.2010 08:59:00 ANR0162W Supplemental database diagnostic information: -1:40001:-911 ([IBM][CLI Driver][DB2/LINUXX8664] SQL0911N The current transaction has been rolled back because of a deadlock or timeout. Reason code "2". SQLSTATE=40001

And now it just crashed again:

29.11.2010 10:30:35 ANR0171I dbiconn.c(1485): Error detected on 0:72, database in evaluation mode.
29.11.2010 10:30:35 ANR0169E An unexpected error has occurred and the TSM server is stopping.
29.11.2010 10:30:35 ANR0162W Supplemental database diagnostic information: -1:57049:-1225 ([IBM][CLI Driver] SQL1225N The request failed because an operating system process, thread, or swap space limit was reached. SQLSTATE=57049
.
 
Nothing particularly strange in /var/log/messages so far.

Output of "ulimit -a":

tsm:~ # ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 94997
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) 10341780
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) 11416960
file locks (-x) unlimited

I have also been informed that there is a new recommendation coming for those that want to run deduplication on 6.2.1.1 and upwards. The recommendation for minimum RAM is going to be 24GB. I currently have 12GB, have ordered 26GB more. Im not running deduplication right now. I dont see how this should saturate the I/O on the DB disk anyhow, as the /swap is on a different disk, and the OS hasnt been using /swap to any extent.
 
Geee... 24 gigs? Where you get it (info) from?
It is already insane, but 24GB...
 
The TSM Release Notes for Linux x86_64 already recommends 12GB, 16GB if deduplication is in use. According to my source, this is going to be increased to 24GB for deduplication when 6.2.1.2 is released.

As for my case: Still no update from IBM, I am waiting for the RAM. The boss has also given the go-ahead to get SSD drives for the database. That should speed things up a bit :)
 
Oki, its running far better now. After we tuned it corresponding to the documentation above, and significantly increased the RAM, its not struggling as much anymore. Basically guys, get LOADS of ram. We are currently running with 38GB of RAM.

This is for about 46 nodes, with a total data amount of about 10TB, and a database of roughly 100GB(in other words: a rather small TSM implementation).
 
That's a shame, I upgraded to 6.x for the dedup feature, and after convincing management to buy me a new server, they are not about to upgrade the RAM... Thanks for sharing.
 
I know this post is a little old, but I have the very same problem with TSM 7.1.1.100.

As you can see here I have 60GiB! of cached memory corresponding to USED shared memory segments.
My system is starting to swap.
Code:
[TSM02][root@tsm02 admin]# free -m
  total  used  free  shared  buffers  cached
Mem:  64297  63864  433  40929  845  59879
-/+ buffers/cache:  3139  61157
Swap:  32767  156  32611

The ipcs:
Code:
[TSM02][root@tsm02 admin]# ipcs -m

---- Segments de memòria compartida ----
clau  shmid  propietari perms  octets  nattch  estat   
0x0d906b74 32768  tsminst1  667  34156016  12   
0x0d906b61 65537  tsminst1  601  103546880  6   
0x00000000 98306  tsminst1  601  268435456  7   
0x5ec76274 2099838979 tsminst2  667  34156016  11   
0x5ec76261 2099871748 tsminst2  601  103546880  6   
0x00000000 2099904517 tsminst2  601  268435456  6   
0x00000000 2101018630 tsminst2  601  131072  2   
0x00000000 2100723719 tsminst2  601  163905536  1   
0x00000000 23035912  tsminst1  601  131072  2   
0x00000000 2100756489 tsminst2  601  1334575104 1   
0x00000000 2100133898 tsminst1  601  131072  2 
...

And finally the load. 30 of load while db2sysc is consuming 39GiB of resident memory...

Code:
[TSM02][root@tsm02 admin]# top

top - 16:51:11 up 56 days, 23:11,  1 user,  load average: 29.01, 25.98, 27.16
Tasks: 272 total,  1 running, 271 sleeping,  0 stopped,  0 zombie
Cpu(s):  0.6%us,  0.5%sy,  0.0%ni, 92.8%id,  6.1%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  64297M total,  63863M used,  434M free,  845M buffers
Swap:  32767M total,  156M used,  32611M free,  59878M cached

  PID USER  PR  NI  VIRT  RES  SHR S  %CPU %MEM  TIME+  COMMAND   
 4260 tsminst1  20  0 42.3g  39g  38g S  28 63.0  6452:37 db2sysc   
 3267 tsminst1  20  0  826m 388m  11m S  12  0.6  9769:55 dsmserv   
13935 root  20  0  9060 1260  816 R  2  0.0  0:00.01 top   
  1 root  20  0 10548  20  0 S  0  0.0  0:32.75 init   
  2 root  20  0  0  0  0 S  0  0.0  0:00.42 kthreadd   
  3 root  20  0  0  0  0 S  0  0.0  5:22.73 ksoftirqd/0   
  6 root  RT  0  0  0  0 S  0  0.0  0:00.09 migration/0 
...

Sar command. Most of the time is iowait:

Code:
[TSM02][root@tsm02 admin]# sar -u 1 3
Linux 3.0.101-0.46-default (tsm02)    30/09/15    _x86_64_

16:54:36  CPU  %user  %nice  %system  %iowait  %steal  %idle
16:54:37  all  0,63  0,00  0,79  30,03  0,00  68,55
16:54:38  all  0,75  0,00  0,96  32,17  0,00  66,12
16:54:39  all  0,84  0,00  1,04  32,23  0,00  65,89
Average:  all  0,74  0,00  0,93  31,48  0,00  66,85
[TSM02][root@tsm02 admin]# sar -b 1 3
Linux 3.0.101-0.46-default (tsm02)    30/09/15    _x86_64_

16:54:44  tps  rtps  wtps  bread/s  bwrtn/s
16:54:45  1051,02  381,63  669,39  23595,92 339628,57
16:54:46  1927,55  1166,33  761,22 384159,18 384604,08
16:54:47  2850,98  2139,22  711,76 818819,61 363435,29
Average:  1955,37  1241,28  714,09 414361,07 362567,79

My system:

2 x CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
64GiB DDR4-2133 Reg. ECC (4x16GiB)
For the Database: 2 x HD SATA 6Gbps 2TB 7.2K RAID1(1+1p)


My database and my usage is high, let's say millions of files backed up weekly, and +1PiB of data.

Code:
ANS8000I Server command: 'q db f=d'

  Database Name: TSMDB1
  Total Size of File System (MB): 1,032,123
  Space Used on File System(MB): 478,020
  Space Used by Database(MB): 369,696
  Free Space Available (MB): 554,103
  Total Pages: 15,899,535
  Usable Pages: 15,896,111
  Used Pages: 15,886,531
  Free Pages: 9,580
  Buffer Pool Hit Ratio: 98,0
  Total Buffer Requests: 50,291,492,831
  Sort Overflows: 0
  Package Cache Hit Ratio: 97,4
  Last Database Reorganization: 30/09/15  09:04:01
  Full Device Class Name: LTO3C_CLASS
Number of Database Backup Streams: 1
  Incrementals Since Last Full: 0
  Last Complete Backup Date/Time: 30/09/15  01:00:12
  Compress Database Backups: No
ANS8002I Highest return code was 0.

I know that my configuration could be better, but it does not justify 50GiB of RAM usage by db2sysc, I need to cap down the db2sysc RAM usage because it's making my system unusable.

Recommendations?
 
Quick answer is: use DBMEMPERCENT in TSM as per http://www-01.ibm.com/support/docview.wss?uid=swg21444747
Long answer is:
in tsm 7.1.1.100 and later the number of table spaces have increased from 4 to 21 (?) so that the large tables occupy thir own tablespaces. and the same with indexes.
The problem is that your db tables are still in the old tablespace layout, you need to get scripts from IBM to export the tables to disk and then import the tables to the correct tablespace. After you have done this (follow IBM support instructions exacly) you can upgrade to tsm 7.1.3.100 that use your memeory more effciently. l
The blueprisnt from IBM regarding tsm 7.1.3 are way more conservative regarding the memory needed for the DB than the older blueprints.

Upgrade to 7.1.1.300, Run a offline reorg, then call IBM and get the the scripts to use the new tablespaces.
Upgrade to ISP 7.1.3.100
(As a side bonus you can use the continer style storge pools. NB read and understand the limitations in this setup)
After this. Relax and smile.
 
Back
Top