ADSM-L

Re: Recovery log utilization does not drop after DB backup (+ pinned logtail)

2001-06-04 11:02:00
Subject: Re: Recovery log utilization does not drop after DB backup (+ pinned logtail)
From: Paul Zarnowski <vkm AT CORNELLC.CIT.CORNELL DOT EDU>
Date: Mon, 4 Jun 2001 11:02:57 -0400
Sheelagh, et al,

Here is my understanding of how the log works (in logmode=rollforward,
anyway).  My understanding is based on some conversations I've had with a
couple of ADSM developers, and also a very good session at the last
SHARE.  I think the session will be repeated at the upcoming SHARE in
Minneapolis.

The log is implemented as a circular queue, with a "head" pointer and a
"tail" pointer.  New entries are written at the "head", advancing it around
the circle.  So, the head and the tail chase each other.  An empty log
starts out with the head and tail at the same place.  As the head advances
around the circle, the tail stays put until a DBB is run.  If the head
advances completely around the circle, catching up to the tail, then the
log is 100% utilized (and the server will crash).  With logmode=normal, the
tail also advances as the oldest running transactions are completed.  With
logmode=rollforward, the tail only advances with a DBB.  In no case will
the tail ever advance past a log record associated with an "in-flight"
transaction.  Thus, the oldest in-flight transaction is considered to be
the one that is pinning the tail.

Log records are used to protect database integrity during transaction
updates.  A transaction which has started, but has not yet completed, is
considered "in-flight".  If the server crashes, the log entries are used to
ensure database integrity for any transactions that were in-flight when the
crash occurred (either by completing the transaction, or by backing it out
I'm not sure which).

A "transaction" is not a session.  Rather, it is a logical update to the
database, which will likely consist of multiple individual database record
updates.  The "transaction" must happen atomically.  That is, it either all
happens or none of it happens.  I.e., if the server crashes mid-way through
the transaction, the partial updates must be backed out.

For backup sessions, I believe that the server "movebatchsize" and
"movesizethresh" options are used to define how many backup files are
grouped together into a logical transaction.  The larger these values are,
I believe, the longer that these transactions will stay in-flight.

So, when does the log appear to fill up?  When the oldest in-flight
transaction stays around long enough that head starts to catch up to
it.  This can happen for a variety of reasons.  We have seen it happen when
there is a combination of (1) a slow or long-running session, and (2)
elevated DB update activity, usually inventory expiration.  The slow
session could be a backup happening over a dial-up line.  Or, it could be a
client database backup of a very large database that takes several hours
(or days) to complete.

What do we do about this?

1. We have defined a 5GB log, the (almost) largest possible.  It's
important to not define it at the maximum possible, as you want to leave
room to extend it in case your server crashes due to a full log.

2. We try to schedule our inventory expiration to not overlap with the
majority of our backup sessions.

3. We have admin schedules that temporarily lower DBB LOGFULL to like 10%
about 2 hours prior to the evening's backups.  This clears out the log in
preparation for the nightly backups, to increase the odds of our making it
through the night without a problem.

4. We have a lower DBB LOGFULL during the night than we do during the day.

5. We have a monitor that checks the log utilization every 5 minutes (among
other things).  If the log utilization gets above 82%, the monitor sends
pager alerts to the TSM oncall staff.  The oncall staff then logs on to the
server to try to identify what has the log pinned and cancels it.  We also
immediately cancel an inventory expiration if it is in progress, as that is
the most likely thing to make the log utilization skyrocket quickly.  This
used to work pretty well on our old server, but now that we have upgraded
to an RS6000/M80, the log can fill up MUCH more quickly during inventory
expiration.  To address this, we will probably automate the killing of the
expiration process in this situation.

6. There are two server options, throughputtimethreshold and
throughputdatathreshold, that were added to the server recently to detect
and automatically kill slow or hung sessions.  We have been experimenting
with this, but have found that it sometimes kills sessions that we don't
think should be killed.  I think the sessions in question might be ones
that have been waiting a long time for a tape to be mounted.

7. I have also heard that an upcoming release of TSM will have a larger
maximum log size limit.

Hope this helps.  As I said before, the above explanation is my
understanding of how TSM works, which may differ from how it actually
works!   ;-)

..Paul

At 11:41 AM 6/4/2001 +0100, Sheelagh Treweek wrote:
Paul,

I'm not sure that I fully understand this issue regarding a thread having
the recovery log pinned at all.  When we had the last crisis as a result
of the log being pinned I was not on hand at the time of the server crash
and so it was difficult afterwards to piece together what might possibly
have happened to cause it.  There was evidence of a long running session
and I concluded it may have been implicated - but could not actually prove
that.  I now get quite nervous when I spot long running backups (passing
over the 'next DBB INCR') and have tried to observe what happens to the
recovery log utilisation then.

Last week I noticed such a session and curiously when the DBB INCR finished
(as part of our daily houskeeping BKSTG, MIG, DBB) the recovery log reset
to 0.1%.  Now as the session was still active it would still be writing to
the recovery log so 0.1% seemed reasonable.  But the point is, it did reset.
So I conclude that there may be some additional factor involved when the
log gets pinned?  Maybe some error condition that is not handled properly?

[We run in roll-forward mode.]

We first had this problem in 1998 and we don't seem to have made very much
progress in a resolution!  Last time I had a PMR on this it was suggested
I put in a design change request (so it must be working as designed now).

Regards, Sheelagh
--

>X-Sender: psz1 AT postoffice3.mail.cornell DOT edu (Unverified)
>Mime-Version: 1.0
>Date: Thu, 31 May 2001 12:19:33 -0400
>From: Paul Zarnowski <vkm AT CORNELLC.CIT.CORNELL DOT EDU>
>Subject: Re: Recovery log utilization does not drop after DB backup
>To: ADSM-L AT VM.MARIST DOT EDU
>
>We run into this problem a lot.  I believe there are a couple of issues
>here.  One issue, that Gerhard mentioned, is that the log utilization does
>not drop quickly when the db backup apparently finishes.  The other issue
>is that a thread can have the log "pinned", preventing the log utilization
>from dropping.  The thread can be a session or a process.  A session
>backing up a single large object, or a smaller one over a slower speed
>network, can cause this problem.  Also, a process (or session) doing tape
>I/O which has gone into error recovery (which can take hours) can also
>cause this problem.  I'm sure there are other situations which can cause a
>pinned log as well.  These two problems can lead to several operational
>problems.  In addition to the log filling up, if you have triggered db
>backups, they can keep triggering in a loop, quickly using up all the tapes
>allocated for db backups (if you are using tapes).
>

< ... snipped ...>

Sheelagh Treweek
Oxford University Computing Services
Email: sheelagh.treweek AT oucs.ox.ac DOT uk
Phone: +44 (0)1865 273205 Fax:-273275
+---------------------------------------------------------------------+
|  http://tsm-symposium.oucs.ox.ac.uk/  OXFORD 20/21 September 2001  |
|  REQUIREMENTS http://tsm-symposium.oucs.ox.ac.uk/requirements.html |
+---------------------------------------------------------------------+