Re: [ADSM-L] Strange behavior on DB(?) or Log(?)

On Jan 21, 2009, at 4:35 PM, Fred Johanson wrote:

We've got two machines which are identical in every way, same
hardware, same AIX, same TSM level, same options set, same storage
pools, domains, everything except one has the DB and Log on local
disk.  On this box things run very slowly: expiration may take a
week, filespace deletion creeps, and we see this as the normal
behavior


Fred -

In my experience, it's seldom the case that two machines of the same
model are identical, even if ordered at the same time.  Vendors are
fond of outfitting computers with "equivalent" components, from
multiple suppliers, during manufacturing.  It's common to see same
type, but different OEMs for memory, cards, and disks.  The cartons
containing the computers may well have come from different
manufacturing periods or different assembly plants.  There can easily
be variations in configuration methods for differing components; and
sometimes there are factory errors in placing jumpers on drives and
cards.  Where the OS perceives hardware elements as different,
differing device driver software may be called upon, and that software
may have differing characteristics as it evolves.

Your staff will have to perform all the usual configuration and
performance reviews involved in chasing a problem like this.  Whereas
this is AIX, the first thing I would look at is the Error Log for any
irregularities, and 'lscfg -v' for detailed comparison of the two
boxes.  You are fortunate to have the advantage of a comparable system
with good performance as a basic for pursuit.  Given that this is
local disk, also comparatively check attributes, as via 'lsattr -El
hdisk4', and particularly check the queue_depth value.  For disks
which IBM's OS recognizes as being programmed for, a Queue Depth of 3
or more will be used and you will get good performance; for disks that
the OS does not recognize, it will minimize Queue Depth to 1, and
performance will be poor.  (See discussions of Queue Depth on the Web,
to perceive impact.)

Your computer implementation people may not have benchmarked or
otherwise performance-checked the systems when they came in, where one
would perform disk and memory stress testing as part of acceptance
verification before committing the boxes to production.  If not,
consider making small logical volumes on the disks to now perform such
disk performance tests.  And definitely assure that all installed
memory and processors are online: I've seen such elements rather
quietly fail and go offline, resulting in impaired performance whose
cause it not apparent.

One thing that I do when performance is hurting on AIX systems is an
iptrace/ipreport run, to see what network traffic is hitting the box.
A DoS attack, or unintentional equivalent, is identifiable only by
such an examination.  Inordinate network activity can greatly elevate
the interrupt rate on the system (particularly with GigE), which will
congest the system bus.  Look for and put a stop to any network access
which should not be happening.

Beyond all that, you need to perform stepwise examination of the
system elements upon which TSM, as an application, runs, including
things like I/O path contention from other system activity.
Thereafter, examination of TSM elements would be warranted.

   Richard Sims  http://people.bu.edu/rbs/