ADSM-L

[ADSM-L] Serializing BACKUP STGPOOL / MOVE DRMEDIA

2014-06-03 17:13:58
Subject: [ADSM-L] Serializing BACKUP STGPOOL / MOVE DRMEDIA
From: Skylar Thompson <skylar2 AT U.WASHINGTON DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 3 Jun 2014 14:12:00 -0700
We've been suffering with the effects of this APAR for a while, which IBM
fixed as a documentation errata rather than fixing TSM itself:

http://www-01.ibm.com/support/docview.wss?uid=swg1IC87352

Basically the issue is that there is a race condition with running MOVE
DRMEDIA on tape volumes while BACKUP STGPOOL is also running. BACKUP
STGPOOL might choose a FILLING volume that MOVE DRMEDIA is also removing
from the library, which causes an operator request to be raised. We must
either check the volume back in, or cancel the request, allow TSM to mark
the volume UNAVAILABLE and then update the volume to be OFFSITE.

We have some challenges in our TSM environment:

1. The data ingest is highly bursty - some days we might have 100GB in
backups, while others we might have 60TB. We average around 2TB/day in
additions to primary storage.

2. We are not staffed 24x7, so we can't have operator requests going off
outside business hours.

3. We have no dedicated staff managing our TSM/tape library environment, so
we prefer not getting any operator requests since we might not be able to
act on them immediately.

4. For budget and policy reasons, we have a weekly (not daily) shipment of
tape to our offsite vault.

I've rejiggered our client and admin schedules, and reclamation to try to avoid
having writes into the copy pools happen while we do the checkout during
business hours, but it's quite difficult to actually quiesce everything.

It seems like we have these options:

1. Just live with it as it is.

2. Don't run BACKUP STGPOOL on the day that the checkout will happen.

3. Automate checking for writes into copy pools and cancel the
session/process responsible for them. This might require restricting the
number of mounts in our tape device classes, and also seems like it has the
risk of being more disruptive than we really want.

Have I missed anything? How are other people approaching this problem?

Thanks,

--
-- Skylar Thompson (skylar2 AT u.washington DOT edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine