Re: [ADSM-L] Our TSM system is a mess. Suggestions? Ideas?

On 14 feb 2010, at 14:12, Dury, John C. wrote:

> We have about 500 nodes and have a backup windows from 5pm until 7am. I have 
> our backup schedule setup so that about 30 nodes do incremental per hour with 
> a few exceptions. We have a 3T disk storage pool and 4 LTO4 drives in our 
> tape library. Our dbbackuptrigger is set at logfull  30% and numincrmeentals 
> of 4.  Our recovery log is filling up almost once per hour while backups are 
> running and not emptying fast enough before it hits 80% when all backups come 
> to a crawl until it is emptied below 80%. Sometimes the recovery log is 
> pinned  at 70% or so and another backup kicks off immediately which again 
> does not empty fast enough and the whole system goes into slowdown after the 
> recovery log is past 80%. Expiration, which used to run in a matter of about 
> 6 hours, is not completing even after running for 24 hours. Our DB is about 
> 97gig and about 74% full. The recovery log is maxed at 13gig.  I don't see 
> anything in the activity log out of the ordinary. The TSM server is AIX 
> 5.3.10.1 TL10 running on an IBM 9131-52A in a logical partition with 20 CPus 
> configured and about 32G of RAM. The TSM DB and disk storage pools are 
> attached to a Clariion CX3-80 via 4G Hbas. I have the recovery log and TSM DB 
> set to use different HBAs then the disk or tape storage pools so the HBAs 
> aren't fighting each other. I've read the tuning and performance manual and 
> matched our settings to match it's suggestions with some small exceptions.
> 
> We have purchased new hardware to move the whole system to Linux and a 
> monster of a box since we want to get to TSM v6.x eventually, hopefully 
> sooner rather than later. AIX hardware and support is tremendously expensive 
> when compared to an intel based box and like a lot of people, we have a very 
> small budget for anything IT related.
> .
> One of the biggest problems we are having is the recovery log filling up too 
> quickly and not emptying fast enough.  Even with a log full trigger of 30%, 
> the incremental backup won't finish before the recovery log hits 80% and with 
> the log full setting so low, we are doing TSM DB backups almost every hour 
> while clients are backing up. This really seems excessive to me.  Why would 
> an incremental backup of the TSM DB take an hour or so to run and is it 
> normal for the  recovery log to fill up so fast while backups are running?
> We even attempted to do a reorg  of the TSM DB but unfortunately it was going 
> to run for much longer than our window allowed so it had to be cancelled. I'm 
> going to try again for next weekend and hopefully talk the powers that be, 
> into a 24 hour window for the reorg. We did do a reorg years ago and the 
> performance improvements were amazing, ie expiration ran in less than an 
> hour. I know that is a bandaid but I have to do something until I can get to 
> version 6 when I can have a bigger recovery log and a new, more powerful 
> server in place.
> I guess I'm just not sure what to look at at this point and frankly I'm 
> exhausted. Our help desk is calling me daily, every day, at 6am or earlier, 
> as "TSM is running slow again".
> Any suggestions on what else to look at? (Sorry for such a fragmented email. 
> I've had about 3 hours sleep at this point)


Hi John,

it looks like you may have a few nodes that are backing up much more slowly 
than the majority. You could try to reduce the transaction size for those 
nodes, that could help, if these nodes are not backing up just a single huge 
file. If you really need to, move these nodes off to a separate TSM instance on 
the same server.

Check out the bufferpool, in 'q db' you'll find the cache hit percentage, if 
that drops, your database is hitting the disk more often. Below 98% is 
unacceptable, being above 99% is recommendable. You do mention the type of 
controller, but not the type of disks. There is a lot to be gained by using 
LUNS that either stripe across a huge number of very fast disks, or setting up 
(in your case) about 4 to 6 dedicated raid-1 LUNs of 15k RPM disks for the 
database.

It sounds like your using the log in roll-forward mode. This is of course the 
recommended setting, but might be worsening the problem. You might want to 
think about using normal mode, until you upgrade to 6.1.

Btw, it sounds like you have quite a large LPAR for your TSM server, much 
larger than needed. With a database of this size, I'd guess that 2 to 4 CPU's 
and 4 GB of RAM should be plenty. Do you run other applications on your TSM 
LPAR?

-- 
Met vriendelijke groeten/Kind Regards,

Remco Post
r.post AT plcs DOT nl
+31 6 248 21 622