ADSM-L

Re: Slow/serial database reads?

2006-10-17 12:24:49
Subject: Re: Slow/serial database reads?
From: Sergio Fuentes <sfuentes AT UMD DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 17 Oct 2006 12:23:36 -0400
I'll also expand my explanation and give some context.

Server specs are:

4 CPU, 8Gb RAM, 64-bit AIX 5.2 ML 6, TSM AIX 5.2.8 and the database on SSA 36GB
15K disk drives.  DB is striped in the LVM, mirrored in TSM and currently at 
80GB.

Though our performance was never stellar with the bufpoolsize at 2GB, there were
no indications of any slowdown for quite some time.  Until one day, backup
sessions were just hanging and slowing down with 15 hour backup windows to back
up 4GB for a few of our clients.

We ran all sorts of nmon traces on the server during the slowdown and noticed no
resource problems.  RAM was fine, disk fine, CPU was 'ok'.  We did notice that
CPU utilization was consistently at 25% throughout our backup windows.  Stgpool
performance was fine too.  Cache hit was at 99%.  There was nothing that hinted
at a problem with the DB performance, except for the performance problems.

My observations with this problem are (and this is from my experience, not
anything IBM has said or verified):

- Check your CPU utilization.  You should see variable utilization throughout
your backup window.  If for some reason CPU is pinned (or floored) at a constant
percentage, there is a ton of processing for what I assume is just for DB page
searches (probably in the bufpool).  Our nmon outputs look much better now that
the bufpoolsize is lower.  When I say pinned, I refer to utilization by the
application (not IO wait, kernel instructions or idling).

- nmon traces and the nmon_analyzer are really good at illustrating your overall
performance.  topas, or top is also useful for active monitoring.  vmstat as
well.  During our slowdown our DB was seeing an average of 30-40 TPS per disk.
Post-slow down we now see 90-120 TPS per disk.

- There were many sessions with long IDLEW (> 30 M), but depending on your
system, you may not see this (this was also highly variable on our system).

- show lock and show resqueue showed tremendous amount of locks and waiters,
respectively.

- In retrospect, I would have started at a low bufpoolsize and set selftune on
from the start.

Apparently, with TSM shifting to a true DB2 back-end this sort of performance
problem with the buffer pool should no longer be an issue.  Until then, I think
many more people will experience this limitation as real memory sizes are
increasing on servers and the performance tuning guides are outdated, misleading
some to allocate too much to the bufpool.

Hope my comments help and clarify my initial position.

Sergio



Matthew Glanville wrote:
Here's my explanation on having to high a "bufpoolsize".

I had this problem on TSM version 5.2+ on Solaris 9 (64 bit) on a server
with 32 GB of memory.
"bufpoolsize" was set to 20 GB and backup performance was horrible
compared to a previous server with only 4 GB of memory and slower network,
disk and cpu's.
After calling support, searching around, and looking at performance stat's
I realized the problem was the large "bufpoolsize"

As we all know TSM documentation recommends 'database cache hit ratio' to
be > 98% to ensure fast performance and reduced disk I/O for the database.
Also documentation indicates to high a size can cause performance problems
if it his set higher than physical memory and the server starts to 'swap'
or page out memory to disk.

But there's another less obvious performance issue with too high a
"bufpoolsize".
Even if you have plenty of memory on a server.
Even if hit ratio is 100% and all the database can be stored in physical
memory.

This performance problem is probably due to calculations or searches
through that bufpool memory for the page being requested.
If this length of time is longer than it takes to access that same data
from disk, then performance is impacted.
With more database activity, the performance is reduced even greater in
this situation.  (expiration,  million+ file backups)

It may not have anything to do with 64 bit vs 32 bit, but, if you have a
64 bit server, TSM can be configured to use more memory and thus
performance could be more greatly degraded due to this issue.

I also believe that at least some of the TSM database access is  'serial'.
 When I had this problem, 1 cpu (out of 8) was entirely tied up, that new
server looked fairly idle even though backups were taking longer.
Now that have a smaller "bufpool" size, more cpu's are being used and
overall performance is much better.

Maybe they will come up with a better database in future TSM versions, I
am fairly sure that this is going to be needed for TSM to keep up with
large 100+ TB servers and billions of files to back up.

Matthew Glanville




Jason Lee <english AT ANIM.DREAMWORKS DOT COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
10/17/2006 10:30 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
Re: Slow/serial database reads?






I actually upped the bufferpool to 2GB from 512MB to try to get
around this. My Cache hit rate was ~96% before putting it up. This is
a 4GB 64bit machine doing nothing else. There is free RAM, though not
as much as there was :-) Actually the dsmserv process is running at
about 2.3GB right now. 1.3GB cached (filesystem).

It would be beneficial to me if someone could show me an iostat or
some such that showed that I was in fact broken. Right now I'm
assuming that the TSM database is broken, but maybe it just sucks?

BTW - I notice when looking at show threads, all the
DIskServerThreads are using the same mutex... which suggests
serialization to me.


Any thoughts?


Thanks


Jason



On Oct 17, 2006, at 7:16 AM, Richard Sims wrote:

On Oct 17, 2006, at 9:49 AM, Sergio Fuentes wrote:

With bufferpool set too high, it actually chokes server performance.
Sergio -

A statement like that needs contextual clarification,
as for example whether the environment is 32-bit or 64-bit, and
whether the system has copious memory to devote to static allocation.
Certainly, in some contexts an oversized buffer pool size is known to
degrade performance, but a large size can certainly help, where the
overall architecture well supports it.

   Richard Sims



--
Jason Lee
DreamWorks Animation
(818) 695-3782

<Prev in Thread] Current Thread [Next in Thread>