ADSM-L

Re: Longlasting tape-reclamation run

2001-03-05 06:33:47
Subject: Re: Longlasting tape-reclamation run
From: Geoff Fitzhardinge <gfitzhar AT AGL.COM DOT AU>
Date: Mon, 5 Mar 2001 16:44:13 +1100
I was interested in the recent items on slow reclamations - see mail
from Richard Sims below.

I started having problems with long-running reclamations about 18 months
ago when we converted from 800MB to 20GB cartridges. We had a three-hour
window in which the reclaim threshold was set to 70. With the small tapes,
this worked o.k., but with the large ones, reclamations began running for
many hours (not days, but certainly as much as 16 hours).  Since we only
have a small number of drives for the high-capacity cartridges, a
reclamation holding two drives for this long caused chaos for later bits
of the schedule.

I posted an item here in February last year, and also raised the issue
with Tivoli support.  Neither provided any solution, but various things
I saw along the way may still be of interest.

1. "Collocation clusters"

   I noticed that the tapes which ran for a long time had a large number
of collocation clusters, as shown by message ANR1142I issued during
reclamation.  Tapes with a small number of clusters reclaimed within an
hour or two and caused me no grief.
   I also found I could predict in advance how many clusters were on
a tape by running the classic
  "select volume_name,node_name from volumeusage where ...."
and counting the number of repetitions of each volume/node combination.
    Like Richard, I had no luck finding any useful documentation on this.

2. Influence of client type.

   I have clients of the following types: Novell Netware, Unix, NT, and
also NT with the Lotus Notes agent.  Since I have collocation on my
onsite tape pool, I was able to determine that the tapes causing trouble
all belonged to Notes clients.  Looking at a list of my tape pool today
(about 200 volumes), I can say that for the non-Notes clients, the
number of clusters is always less than 10.  The Notes client volumes have
HUNDREDS (highest today is 967).
   I don't know if this is something to do with the Notes agent itself,
or just a result of the fact that Notes seems to generate vast numbers of
very small documents.
   The only hope on the horizon, for me, is that this issue will go away
when the Notes servers get converted to Domino V5 and the individual email
backups get replaced by a transaction log file.  But maybe other kinds of
client with many small files will still have a problem, especially with
the way tape capacities keep increasing.

3.  Why do the reclamations run so slow?

   I was able to determine that the process spends most of its time
waiting for data from the input tape.  In S/390 terms, every small burst
of data transfer (few milliseconds) is followed by a Locate Block (tape
search) function which takes more like 10 SECONDS.
   With offsite reclamations, there is an additional issue with large
tapes because of the requirement to sort a potentially huge number of
database entries (see below).

4. Did Tivoli help?

   I got excellent help from the local (Australian) IBM support, but
the eventual response from the change team was that all was working as
designed and that if I wanted sensible housekeeping performance I should
ask for it through my marketing rep.  I forwarded my correspondence to
such a person but got no reply.  Regret to say I didn't follow up at the
time because by this stage I knew enough to be able to live with the
situation.  By now I assume that the response would be to say it's time to
get the Notes servers upgraded to Domino 5.  I know that, but the Notes
people seem to have plenty else to do (and besides, they like the present
ability of ADSM to recover individual emails!)

5. How do I live with it?

   (a) My main salvation was to use an external sheduling package to
cancel reclamation processes at a specific time (can't do this from
within ADSM). This allows the rest of the schedule to carry on o.k..
   (b) I learned the difference between onsite reclamations (one tape at
a time) and offsite reclamations (pick a bunch of volumes, sort database
entries to get order for mounting input tapes, then start data transfer).
   (c) With onsite reclamations, I use various manual tricks, such as
using Move Data into a disk pool, or selectively making volumes
unavailable so a Netware tape gets reclaimed ahead of a Notes tape). I
just keep my head above water, but it is a bit of a struggle.
   (d) With offsite reclamations, the sorting process can be quite
indigestible because of the large number of files on large tapes.  I find
I have to keep tinkering with the reclaim threshold so it doesn't try to
reclaim more than three or four volumes at a time.

Geoff Fitzhardinge
Australian Gas Light Company

> -----Original Message-----
> From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU]On Behalf Of
> Richard Sims
> Sent: Wednesday, February 28, 2001 1:59 PM
> To: ADSM-L AT VM.MARIST DOT EDU
> Subject: Re: Longlasting tape-reclamation run
>
>
> >   my TSM-server 3.7.2 on AIX 4.3.3 runs a tape-reclamation  now for
> >nearly 4 days.  Why does it last so long ?
> ...
> >02/28/01   09:26:04      ANR1142I Moving data for collocation
> cluster 3608
> >                         of 4248 on volume 000059.
>
> Peter - What stands out in the above message is the very large number
>         for total clusters (4248).  I typically see a number like 23
> when I perform reclamations.  This is probably at the root of it.
>
> I have yet to see a good definition, though, as to what a "cluster" is in
> the TSM server context: the only manual that mentions it is the Messages
> manual, and that explains it only as "data objects".  I'm wondering if
> you have ended up with a large number of tiny files on the tape due to
> a small TSM transaction size in client-server sessions (TXNBytelimit,
> TXNGroupmax).  If your database size is modest, you can probably do a
> Select on the Contents table and check the reported FILE_SIZE to check
> what Aggregate size you are getting.  Regardless, no tape reclamation
> should take 4 days.  You may want to check with Tivoli Support as to
> what is going on.  If you do, and get some insights from them, please
> share with the List.
>
>   Richard Sims, BU