ADSM-L

Re: [ADSM-L] Our TSM system is a mess. Suggestions? Ideas?

2010-02-14 10:51:44
Subject: Re: [ADSM-L] Our TSM system is a mess. Suggestions? Ideas?
From: "Lamb, Charles P." <cplamb AT NPPD DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Sun, 14 Feb 2010 09:50:34 -0600
BTW, we have a TSM V5.5.2.0 with IBM 3584-L32 and 3-3584-D32s.  A TSM system 
needs I/Os, I/Os, I/Os and fast I/Os and a lot of disk space.........

-----Original Message-----
From: Lamb, Charles P. 
Sent: Sunday, February 14, 2010 9:35 AM
To: 'ADSM-L AT VM.MARIST DOT EDU'
Subject: RE: Our TSM system is a mess. Suggestions? Ideas?

Hi...

We have a similar TSM system.  We have our TSM DB (over 400GB) only about 1/3 
full and proactively run incrementals.  We have fourteen LTO3 tapes drives 
directly connected using 4Gbps FC adapters.  IBM 9155-55A with 8-WAY/64GB of 
memory and an IBM SVC(four nodes)/FAStT system using DS4800s that uses about 
6TB of TSM disk cache using 4-4Gbps FC adapters.  Using fast disk space helps 
in TSM DB backups and other TSM activities.  Our server environment is SAP R/3 
landscapes on RISCs, Intel/MS and VMware farms, etc. 

I would think increasing TSM DB size and using a faster disk system would help. 
 Placing SVC in front of disk space helps caching the data. 

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of 
Dury, John C.
Sent: Sunday, February 14, 2010 7:12 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: [ADSM-L] Our TSM system is a mess. Suggestions? Ideas?

We have about 500 nodes and have a backup windows from 5pm until 7am. I have 
our backup schedule setup so that about 30 nodes do incremental per hour with a 
few exceptions. We have a 3T disk storage pool and 4 LTO4 drives in our tape 
library. Our dbbackuptrigger is set at logfull  30% and numincrmeentals of 4.  
Our recovery log is filling up almost once per hour while backups are running 
and not emptying fast enough before it hits 80% when all backups come to a 
crawl until it is emptied below 80%. Sometimes the recovery log is pinned  at 
70% or so and another backup kicks off immediately which again does not empty 
fast enough and the whole system goes into slowdown after the recovery log is 
past 80%. Expiration, which used to run in a matter of about 6 hours, is not 
completing even after running for 24 hours. Our DB is about 97gig and about 74% 
full. The recovery log is maxed at 13gig.  I don't see anything in the activity 
log out of the ordinary. The TSM server is AIX 5.3.10.1 TL10 running on an IBM 
9131-52A in a logical partition with 20 CPus configured and about 32G of RAM. 
The TSM DB and disk storage pools are attached to a Clariion CX3-80 via 4G 
Hbas. I have the recovery log and TSM DB set to use different HBAs then the 
disk or tape storage pools so the HBAs aren't fighting each other. I've read 
the tuning and performance manual and matched our settings to match it's 
suggestions with some small exceptions.

We have purchased new hardware to move the whole system to Linux and a monster 
of a box since we want to get to TSM v6.x eventually, hopefully sooner rather 
than later. AIX hardware and support is tremendously expensive when compared to 
an intel based box and like a lot of people, we have a very small budget for 
anything IT related.
.
One of the biggest problems we are having is the recovery log filling up too 
quickly and not emptying fast enough.  Even with a log full trigger of 30%, the 
incremental backup won't finish before the recovery log hits 80% and with the 
log full setting so low, we are doing TSM DB backups almost every hour while 
clients are backing up. This really seems excessive to me.  Why would an 
incremental backup of the TSM DB take an hour or so to run and is it normal for 
the  recovery log to fill up so fast while backups are running?
We even attempted to do a reorg  of the TSM DB but unfortunately it was going 
to run for much longer than our window allowed so it had to be cancelled. I'm 
going to try again for next weekend and hopefully talk the powers that be, into 
a 24 hour window for the reorg. We did do a reorg years ago and the performance 
improvements were amazing, ie expiration ran in less than an hour. I know that 
is a bandaid but I have to do something until I can get to version 6 when I can 
have a bigger recovery log and a new, more powerful server in place.
I guess I'm just not sure what to look at at this point and frankly I'm 
exhausted. Our help desk is calling me daily, every day, at 6am or earlier, as 
"TSM is running slow again".
Any suggestions on what else to look at? (Sorry for such a fragmented email. 
I've had about 3 hours sleep at this point)