ADSM-L

Re: Recovery log utilization does not drop after DB backup (+ pinned logtail)

2001-06-05 12:21:29
Subject: Re: Recovery log utilization does not drop after DB backup (+ pinned logtail)
From: Paul Zarnowski <vkm AT CORNELLC.CIT.CORNELL DOT EDU>
Date: Tue, 5 Jun 2001 12:22:34 -0400
At 11:46 AM 6/5/2001 +0100, Sheelagh Treweek wrote:
What I think you are saying is that for the log (in roll-forward)
to fill implies that the head has circled round to catch up the
tail because a transaction is still uncompleted?  Some of the
examples you cite are very clear and plausible : a large DB backup,
a backup over a slow dial-up line, perhaps with retries.  I'm less
clear about the inventory expiration impact - are you saying that
because it does a much higher level of activity/updates that it
is excluding (or slowing) another activity; or, are you saying that
when running a very long expire inventory that the process itself
can get into the situation of a pinned logtail?

The impact that inventory expiration has is not that it pins the tail, but
that it advances the head more quickly.  For logmode=roll-forward, if you
have run a DBB, and the tail is pinned, the log utilization won't drop and
expiration will cause the utilization to continue increasing rapidly.

>What do we do about this?
>

I know (from conversations we have had in the past) that this problem
has impacted your site much more than here and that the services you
provide are broadly on similar scale and scope to here at Oxford.

The principle differnces I observe is that we have split our services
onto separate TSM server instances as we have scaled up whereas I
believe Cornell have retained a single server? .

[I know you talked last year of splitting the service up.]

We also have data going to disk pools first so rarely have sessions
waiting for tape mount - and presumably a restore/retrieve/recall
does not take a transaction slot in the log (although there may be
a date update after the data has been given back) ?

I think you are right in that file recoveries should not pin the log.  Most
of our data does go to disk pools first, but we do have some large backups
that go straight to tape, because they are too large for us to accommodate
on disk first.  Sometimes these sessions appear to pin the log, but there
are times when disk-based sessions pin it also (i.e., backups over dialup
lines).


Maybe what this has given us is less potential conflict of sessions/
activities/processes and maybe a little less vulnerability in this
particular aspect.

That makes sense to me.


The lengths you detail that you have had to go to in order to avoid
this problem are quite extraordinary and unacceptably complex.

[As was having to split our service because of other scaling issues.]

I can see that a larger recovery log would alleviate the situation but
this maybe a problem that could potentially impact more sites in the
future as they move to server-free backups ... when the TSM server
becomes much more of a database engine?

While I'm not sure exactly what situations will exacerbate this problem, we
noticed that it got worse when we upgraded our server.  The reason for this
is that as the server gets faster, certain activities can happen more
rapidly, such as expiration and most (but not all) backup/archive
sessions.  However, the "problem" activities which can pin the log are not
markedly improved by a faster server.  E.g., slow dialup sessions don't get
any faster; some large client database backup sessions do not get any
faster if they are limited by the client system or the network; etc.  Since
we were already maxed out with a 5GB recovery log, when we upgraded our
server the log just filled up faster.  As servers get faster and faster,
and as users back up more and more files, this problem will get worse
unless some relief is provided.  Increasing the max log size will provide
some relief, and I hope it's enough, but I think your observation about
things getting worse with server-free backups is right on target.  I think
the solution lies in either enhanced logging (multiple logs?), or in
addressing the situations that can cause the log to be pinned and devise
some way to automatically detect this situation, and somehow remedy it.

It also seems to me that a client session that 'dies' but doesn't
disconnect from the server is a potential threat?  If there were
a server option (like) : CONSIDERDEADIFINACTIVEFOR {60mins} rather
than the throughput{data|time}threshold options (which  weren't
invented for this reason I believe and which I have had little
experience with) might alleviate a cause.  The same might be said
of a rogue process.  Although these situations are relatively rare,
they do happen.

I'm not sure if we've seen this.  If the session "dies", won't the idle
timeout terminate it?



--
Paul Zarnowski                         Ph: 607-255-4757
Paul Zarnowski                         Ph: 607-255-4757
747 Rhodes Hall, Cornell University    Fx: 607-255-8521
Ithaca, NY 14853-3801                  Em: psz1 AT cornell DOT edu