ADSM-L

Re: OS390 TSM Performance questions.

2003-02-12 12:12:14
Subject: Re: OS390 TSM Performance questions.
From: "Darby, Mark" <Mark.Darby AT HQ.DOE DOT GOV>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Wed, 12 Feb 2003 12:02:38 -0500
Hello, Al.

We have much to share.  We are OS/390 2.10 on a 7060-H50 (~120 MIPS) with
approx. 100Mbit network connectivity and have had many, long-standing TSM
performance problems.  We are currently running 4.2.3.2.  We have discovered
in working with TSM support (without a technical explanation as to "why?")
that reducing the TSM server's region to 512M and setting (by reducing)
bufpoolsize to 131072 (i.e., 128MB) works for us.  We had previously tried
several region settings from 1.75G down to 960M with the same problematic
results until "happening upon" the severely reduced, storage-constrained
"settings" with which we are now running (or should I say, limping).  This
was determined with the help of the Tivoli "performance team" in response to
a long string of numerous performance-related PMRs.

Here are some things we have discovered - and which work best for us:
1. Region over 512M causes serious and pervasive performance problems
2. BufPoolSize much over 131072 MAY also cause/contribute similarly (and
definitely doesn't help)
3. CPU utilization is VERY high for any database-intensive processes
4. Database corruption may be the root cause for our severe symptoms (this
is purely conjecture on my part at this point, but supported, to some
degree, by TSM support statements recommending we fix known DB corruption -
which, of course, with dump/reload/audit performance being what it is, is an
impossible "hit" to take).  FYI: We plan to "move out" of the TSM server
with database corruption "into" a new, virgin server(s) as soon as time and
other factors permit.

Prior to adjusting our "settings" as indicated above, we were experiencing
severe, pervasive, and nearly continual performance problems (and CPU
over-utilization), server unresponsiveness, and what I would call
"stress-related" failures of all sorts, and a whole plethora of other,
unmentioned "problems".  After making "the adjustments" we have found that,
although the TSM server still frequently gets "tangled up in its shorts",
the problems are not as severe nor are they as frequent or pervasive, and
performance is better than when we ran it in the "larger memory footprint".
Although it is closer to acceptable, it is still well below the kind of
performance I expect from an application running on the platform (i.e.,
S/390).

We cannot even imagine a reason why these adjustments have helped, but they
have.  It is totally counter-intuitive to me that reducing the memory
footprint would yield these results, but it has.

I would call IBM/Tivoli support, if I were you, and start a diagnostic
regimen with them on your particular issues.  We were told by them that many
OS/390 shops are getting far superior performance, throughput, and (I
presume) a much better CPU utilization picture than we experience.  Further,
their stated position is that some environmental factor, unique to "us", is
the root cause for our performance issues.  Aside from our limited bandwidth
and database corruption "issues", I cannot think of any other factor that
makes us extremely unique among all the other users of the TSM server on
OS/390.

You are the first shop I have heard reporting an experience similar to ours.

Please feel free to explore this further with me off-line if you wish.

Regards,
Mark Darby
(301) 903-5229

-----Original Message-----
From: Alan Davenport [mailto:Alan.Davenport AT SELECTIVE DOT COM]
Sent: Wednesday, February 12, 2003 10:46 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: OS390 TSM Performance questions.

Hello,

        We're running TSM v5.1.5.4 on an IBM 20660A2 processor running OS390
v10. There is a 100Mbit, single port OSA card on the processor. We are
backing up 197 clients per night. MAXSCHEDSESSIONS is set to allow 116
simultaneous backup sessions. Our backup window begins at 20:00 and ends at
07:30 the next morning. We are seeing poor performance on our backups during
the window.  For example, one server that will backup in 6-7 minutes outside
the window takes hours to complete during the window. The TSM server has a
region size of 1280M and MPTHREADING is set to YES. Self tune buffer size
and TXN size is enabled. We are backing up to a 100GB disc buffer to an EMC
model 8830 drive array. On average we backup 30-40GB per night with a peak
of 75-80GB.

        I know there are much larger shops backing up many more servers out
there running OS390 also. What I would like to know is, on large shops, what
is your OSA configuration? Are you running multi-port OSAs and/or gigabit
cards? For comparison, I would also like to know how many clients you are
backing up per night. Where do you think the bottleneck is? Have you seen
similar problems and what did you do to help alleviate the problem? I am
fairly confident that TSM is not CPU constrained during the window. We
recently moved TSM to a higher service class with little effect on the
problem.  Do you feel we are saturating the OSA card?

        Any thoughts and suggestions would be greatly appreciated.

          Take care,
               Al

Alan Davenport
Senior Storage Administrator
Selective Insurance Co. of America
alan.davenport AT selective DOT com
(973) 948-1306