ADSM-L

Re: Recovery log utilization does not drop after DB backup (+ pin ned logtail)

2001-06-04 12:16:43
Subject: Re: Recovery log utilization does not drop after DB backup (+ pin ned logtail)
From: Jeff Bach <jdbach AT WAL-MART DOT COM>
Date: Mon, 4 Jun 2001 11:17:37 -0500
Why write one?

STOP:
LOGPCT='dsmadmc -id=${ID1} -password=${PASS1} q log | tail -4| grep ","| awk
'{print $8}''
if [ $LOGPCT -ge 40 ]
then
echo $ID2
echo $PASS2

EXPPID='dsmadmc -id=$ID2 -password=$PASS2 q pro| grep Expiration| awk
'{print $1}''
echo $EXPPID
CHECK='dsmadmc -id=$ID2 -password=$PASS2 q pro|grep $EXPPID | wc -l'
echo $CHECK
if [ $CHECK -eq 1 ]
then
dsmadmc -id=${ID2} -password=${PASS2} cancel pro $EXPPID
fi

fi
START:
LOGPCT='dsmadmc -id=${ID1} -password=${PASS1} q log | tail -4| grep ","| awk
'{print $8}''
if [ $LOGPCT -le 30 ]
then

EXPPID='dsmadmc -id=$ID2 -password=$PASS2 q pro| grep Expiration| awk
'{print $1}''
CHECK='dsmadmc -id=$ID2 -password=$PASS2 q pro|grep $EXPPID | wc -l'
if [ $CHECK -eq 0 ]
then
#dsmadmc -id=${ID2} -password=${PASS2} expire inventory

fi
fi



Jeff Bach
Home Office Open Systems Engineering
Wal-Mart Stores, Inc.

WAL-MART CONFIDENTIAL


        -----Original Message-----
        From:   Paul Zarnowski [SMTP:vkm AT CORNELLC.CIT.CORNELL DOT EDU]
        Sent:   Monday, June 04, 2001 10:03 AM
        To:     ADSM-L AT VM.MARIST DOT EDU
        Subject:        Re: Recovery log utilization does not drop after DB
backup (+ pinned logtail)

        Sheelagh, et al,

        Here is my understanding of how the log works (in
logmode=rollforward,
        anyway).  My understanding is based on some conversations I've had
with a
        couple of ADSM developers, and also a very good session at the last
        SHARE.  I think the session will be repeated at the upcoming SHARE
in
        Minneapolis.

        - The log is implemented as a circular queue, with a "head" pointer
and a
        "tail" pointer.  New entries are written at the "head", advancing it
around
        the circle.  So, the head and the tail chase each other.  An empty
log
        starts out with the head and tail at the same place.  As the head
advances
        around the circle, the tail stays put until a DBB is run.  If the
head
        advances completely around the circle, catching up to the tail, then
the
        log is 100% utilized (and the server will crash).  With
logmode=normal, the
        tail also advances as the oldest running transactions are completed.
With
        logmode=rollforward, the tail only advances with a DBB.  In no case
will
        the tail ever advance past a log record associated with an
"in-flight"
        transaction.  Thus, the oldest in-flight transaction is considered
to be
        the one that is pinning the tail.

        - Log records are used to protect database integrity during
transaction
        updates.  A transaction which has started, but has not yet
completed, is
        considered "in-flight".  If the server crashes, the log entries are
used to
        ensure database integrity for any transactions that were in-flight
when the
        crash occurred (either by completing the transaction, or by backing
it out
        - I'm not sure which).

        - A "transaction" is not a session.  Rather, it is a logical update
to the
        database, which will likely consist of multiple individual database
record
        updates.  The "transaction" must happen atomically.  That is, it
either all
        happens or none of it happens.  I.e., if the server crashes mid-way
through
        the transaction, the partial updates must be backed out.

        - For backup sessions, I believe that the server "movebatchsize" and
        "movesizethresh" options are used to define how many backup files
are
        grouped together into a logical transaction.  The larger these
values are,
        I believe, the longer that these transactions will stay in-flight.

        So, when does the log appear to fill up?  When the oldest in-flight
        transaction stays around long enough that head starts to catch up to
        it.  This can happen for a variety of reasons.  We have seen it
happen when
        there is a combination of (1) a slow or long-running session, and
(2)
        elevated DB update activity, usually inventory expiration.  The slow
        session could be a backup happening over a dial-up line.  Or, it
could be a
        client database backup of a very large database that takes several
hours
        (or days) to complete.

        What do we do about this?

        1. We have defined a 5GB log, the (almost) largest possible.  It's
        important to not define it at the maximum possible, as you want to
leave
        room to extend it in case your server crashes due to a full log.

        2. We try to schedule our inventory expiration to not overlap with
the
        majority of our backup sessions.

        3. We have admin schedules that temporarily lower DBB LOGFULL to
like 10%
        about 2 hours prior to the evening's backups.  This clears out the
log in
        preparation for the nightly backups, to increase the odds of our
making it
        through the night without a problem.

        4. We have a lower DBB LOGFULL during the night than we do during
the day.

        5. We have a monitor that checks the log utilization every 5 minutes
(among
        other things).  If the log utilization gets above 82%, the monitor
sends
        pager alerts to the TSM oncall staff.  The oncall staff then logs on
to the
        server to try to identify what has the log pinned and cancels it.
We also
        immediately cancel an inventory expiration if it is in progress, as
that is
        the most likely thing to make the log utilization skyrocket quickly.
This
        used to work pretty well on our old server, but now that we have
upgraded
        to an RS6000/M80, the log can fill up MUCH more quickly during
inventory
        expiration.  To address this, we will probably automate the killing
of the
        expiration process in this situation.

        6. There are two server options, throughputtimethreshold and
        throughputdatathreshold, that were added to the server recently to
detect
        and automatically kill slow or hung sessions.  We have been
experimenting
        with this, but have found that it sometimes kills sessions that we
don't
        think should be killed.  I think the sessions in question might be
ones
        that have been waiting a long time for a tape to be mounted.

        7. I have also heard that an upcoming release of TSM will have a
larger
        maximum log size limit.

        Hope this helps.  As I said before, the above explanation is my
        understanding of how TSM works, which may differ from how it
actually
        works!   ;-)

        ..Paul

        At 11:41 AM 6/4/2001 +0100, Sheelagh Treweek wrote:
        >Paul,
        >
        >I'm not sure that I fully understand this issue regarding a thread
having
        >the recovery log pinned at all.  When we had the last crisis as a
result
        >of the log being pinned I was not on hand at the time of the server
crash
        >and so it was difficult afterwards to piece together what might
possibly
        >have happened to cause it.  There was evidence of a long running
session
        >and I concluded it may have been implicated - but could not
actually prove
        >that.  I now get quite nervous when I spot long running backups
(passing
        >over the 'next DBB INCR') and have tried to observe what happens to
the
        >recovery log utilisation then.
        >
        >Last week I noticed such a session and curiously when the DBB INCR
finished
        >(as part of our daily houskeeping BKSTG, MIG, DBB) the recovery log
reset
        >to 0.1%.  Now as the session was still active it would still be
writing to
        >the recovery log so 0.1% seemed reasonable.  But the point is, it
did reset.
        >So I conclude that there may be some additional factor involved
when the
        >log gets pinned?  Maybe some error condition that is not handled
properly?
        >
        >[We run in roll-forward mode.]
        >
        >We first had this problem in 1998 and we don't seem to have made
very much
        >progress in a resolution!  Last time I had a PMR on this it was
suggested
        >I put in a design change request (so it must be working as designed
now).
        >
        >Regards, Sheelagh
        >--
        >
        > >X-Sender: psz1 AT postoffice3.mail.cornell DOT edu (Unverified)
        > >Mime-Version: 1.0
        > >Date: Thu, 31 May 2001 12:19:33 -0400
        > >From: Paul Zarnowski <vkm AT CORNELLC.CIT.CORNELL DOT EDU>
        > >Subject: Re: Recovery log utilization does not drop after DB
backup
        > >To: ADSM-L AT VM.MARIST DOT EDU
        > >
        > >We run into this problem a lot.  I believe there are a couple of
issues
        > >here.  One issue, that Gerhard mentioned, is that the log
utilization does
        > >not drop quickly when the db backup apparently finishes.  The
other issue
        > >is that a thread can have the log "pinned", preventing the log
utilization
        > >from dropping.  The thread can be a session or a process.  A
session
        > >backing up a single large object, or a smaller one over a slower
speed
        > >network, can cause this problem.  Also, a process (or session)
doing tape
        > >I/O which has gone into error recovery (which can take hours) can
also
        > >cause this problem.  I'm sure there are other situations which
can cause a
        > >pinned log as well.  These two problems can lead to several
operational
        > >problems.  In addition to the log filling up, if you have
triggered db
        > >backups, they can keep triggering in a loop, quickly using up all
the tapes
        > >allocated for db backups (if you are using tapes).
        > >
        >
        >< ... snipped ...>
        >
        >Sheelagh Treweek
        >Oxford University Computing Services
        >Email: sheelagh.treweek AT oucs.ox.ac DOT uk
        >Phone: +44 (0)1865 273205 Fax:-273275

>+---------------------------------------------------------------------+
        >|  http://tsm-symposium.oucs.ox.ac.uk/  OXFORD 20/21 September 2001
|
        >|  REQUIREMENTS
http://tsm-symposium.oucs.ox.ac.uk/requirements.html |

>+---------------------------------------------------------------------+


**********************************************************************
This email and any files transmitted with it are confidential
and intended solely for the individual or entity to
whom they are addressed.  If you have received this email
in error destroy it immediately.
**********************************************************************
<Prev in Thread] Current Thread [Next in Thread>