APAR PQ01122 and more......

Date:     April 29, 1997      Time:    10:14 AM
From:     Jerry Lawson
     The Hartford Insurance Group
     (860) 547-2960           jlawson AT thehartford DOT com
Well, I have had an interesting 24 hours with ADSM, to say the least.  And to
all of you out there with MVS servers at level 12, beware...

My boss had been after me to provide some idea of tape utilization, so
yesterday I dumped down the volhist log to my PC, and started going through
it.  Much to my chagrin, I noticed that substantially more tapes were going
to the offsite pool than were being returned, even though there was a large
number of tapes being returned daily from the regular tape pool.  I poked
around some more, and found that there were approximately 1500 tapes (we use
3480 still) in the copypools in Empty status.  This surprised me, because
once a day, we issue (through the administrative schedule) the following
command:

     update vol * wherest=offsite whereacc=empty access=readwrite

This command had been changing the status of all tapes that had been
reclaimed from the offsite pool to show that they were onsite, and thus
available for scratch.  But this had not been working, since approximately
the same date that the level 12 maintenance had been applied to my server.  I
checked IBMLink, and sure enough, there was a hit - PQ01122, which states
that with level 12, the code to check wherest=offsite had been dropped!   The
net effect was that each time I entered the above command, a response was
received saying that nothing matched, and ADSM went on it's merry way.
Meanwhile my cache of scratch tapes kept growing.

So, I changed my scheduled command by removing the wherest=offsite, and put
the command in for last night (runs at 8:10 PM each night).  Much to my
surprise, when I cam in this morning, and checked ADSM, I found that the
command had worked, and all the tapes had been changed successfully, but ADSM
was still returning these tapes to scratch!  This process had gone on all
night - at an average of about 100 tapes an hour!  Well, I thought, at least
we are getting close to being done... but response was really poor. Just then
the performance guy called me, saying that ADSM was using 90% of an engine -
this is an Amdahl 8670, which has an engine speed rating better than 60 MIPS!
 He was getting concerned as we were getting to the time of day when our
online IMS systems would start to want their unfair share of the CPU.   So, I
decided to reset the status of these tapes to offsite with an appropriate
Update vol command.

I entered the command, and after a 5 minute wait, I received back messages
indicating that the status of approximately 150 tapes had been changed to
offsite.  Great - that should solve the problem... so I called the
performance guy to get him off my back.  Soon however, he was calling me back
ADSM still wanted 90% of an engine.  Some Omegamon analysis showed no IO to
speak of, and little to point to as the culprit.  Then we happened to look in
the log, and noticed that ADSM was still trying to return the tapes to
scratch - the same ones I had just updated to offsite status.  So.. We
bounced the server, and when it came back, we received a bunch of messages
indicating that the tapes ADSM wanted to scratch were now in offsite status.
There were, by last count, about 140 of them.  They should be scratched
tonight when the admin command runs again.

NOW - IBM - looks like a couple of things are not what they should be
here....

1.  Why did the server continue to process the files even after I changed
their status.  Obviously, there was an in core table that was being worked
on, but if I modify the status of the tapes, ADSM should have stopped trying
to process them, or at least verified (when I entered the command) that some
other task was processing the same volumes).

2.  Why did the scratch process take so long, and use sooooo much CPU.  We
 did not see much I/O to the DB,  it appears that the server was trying to
determine if the tapes were empty or something.. We also noticed that with
the exception of an hour between 4 and 5AM, there was a steady decline in the
number of tapes being processed - starting at 270 per hour at 8:00 last
night, and dropping steadily until we were doing about 45/hour.  I expect the
workload to be low at 8:00 on the machine, and the batch work to increase on
the machine over the night, but it should drop off again around 4-5, and stay
low until 8 or so..

3.  It also appears that  the tapes were being returned to scratch under the
original scheduled process.  I say that because I am seeing ANR2753 messages
in my log, followed by the name of the schedule that was executed.  I then
find that all of the scheduled tasks that were supposed to happen after that
have been "missed" which implies that the admin schedules are single threaded
is this true?

So - if you've made it this far - and you're running level 12 on MVS, and you
do offsite pools - go check to see how many tapes you have in offsite status
Q vol * stg=copypoolname > offsite.txt  I then send this output in Excel, and
sort it on the "access column"  (F)   The cover letter for PQ01122 indicated
that the problem also occurred if you ran DRM, but there was a circumvention
mentioned there.  Since we do not have DRM, the command given did not work.
The cover letter did have a target date of May, but there was no assignment
to a PTF yet, if my memory serves me correctly.

Lastly, there was no indication on the cover letter if the problem was MVS
only or not.

Of course if I had some tape drives with some real capacity, this might not
have been such a big issue....

-----------------------------------------------------------------------------
-------------------------
-------------------------
                                         Jerry
                                         Jerry

Insanity is doing the same thing over and over..and expecting the results to
be different - Anon.