Re: [ADSM-L] DISASTER: How to do a LOT of restores?

(It's the weekend, so I'm finally coming up for air, a full week after
the fire.)

And this is also a bummer compared to how I would have liked to have had
an active data pool. This sounds like it was invented by the Department
of Complication. (Though in my DR situation, it might be useful...)

I would have wanted an active data pool to be the first storage pool in
a traditional TSM heirarchy. New backups would go ONLY into the
active-data pool. Then as files therein became inactive, they would be
migrated to the next pool in the heirarchy automatically by normal
migration. There would still only be one copy in the heirarchy, as there
is now. Backup copy storage pools would still be made about the same way
they are now - except that the need for precise timing prior to
migration would no longer be a restriction. You would only need to time
stgpool backups relative to the client backup window.

The idea in a D2D2T environment would be to have the primary Active Data
Pool on disk, and the copy and inactive pools on tape. In case of a
server disaster, all data would still have been backed up to tape, in
the copy pools. MOST restores are of active data, especially in a DR
situation like I've got. Considering that most restores of inactive data
are for small numbers of files, we could even afford to make it
non-collocated. Most full restores are of only active data (Exception:
e-discovery) - when the building burns, or the hard drive fails, I only
want my latest files back.

An active data pool could also be the second storage pool in a 3-tier
D2D2D2T heirarchy. Then the primary storage pool would be on fast disk,
as is typical, in order to receive data from today's fast NICs quickly.
Then the first migration would copy from that pool to the slower Active
Data Pool, which would be on larger, slower, SATA-type disks. Then
inactive files in the Active Data Pool would migrate to tape in a second
migration. In this 3-tier setup, copy storage pools could be written at
either tier 1 or tier 2.

But they didn't do it that way, unfortunately. (Although their way could
definitely help in a DR situation with non-collocated tape pools.)

But then again, I got to thinking, how hard would it be to do it my way?
Not very. All we'd need is a new option on normal migration to only
migrate inactive files. Everything else is there.

The bigger immediate bummer is that, due to this APAR, I can't even use
it for my disaster recovery. The storage pool these nodes are in is
immense, about 40TB. I'd need to have about 35TB of disk to create an
Active Data Pool for it, without the ability to do it for only selected
nodes. So I'm back to my strategy of MOVE NODEDATA on all the toasted
nodes in one command, which should result in only one tape mount for
each of the 355 tapes in the storage pool, and only 1.5TB of new space
needed, which is feasible. That's the least worst of several available
bad choices, but those are the kinds of decisions you make in a
disaster.

P.S. LOCK NODE is your friend!

Roger Deschner      University of Illinois at Chicago     rogerd AT uic DOT edu
               Academic Computing & Communications Center


On Tue, 22 Jan 2008, Curtis Preston wrote:

>Bummer. :( But when it's fixed, I sure think it sounds like a better
>solution to this situation than the traditional answers -- even if only
>used on demand.
>
>---
>W. Curtis Preston
>Backup Blog @ www.backupcentral.com
>VP Data Protection, GlassHouse Technologies
>
>-----Original Message-----
>From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
>James R Owen
>Sent: Tuesday, January 22, 2008 6:37 PM
>To: ADSM-L AT VM.MARIST DOT EDU
>Subject: Re: [ADSM-L] Fw: DISASTER: How to do a LOT of restores? [like
>Steve H said, but...]
>
>DR strategy using an ACTIVEdata STGpool is like Steve H said, but
>with minor additions and a major (but temporary) caveat:
>
>COPY ACTIVEdata is not quite ready for this DR strategy yet:
>
>See APAR PK59507:  COPy ACTIVEdata performance can be significantly
>degraded
>(until TSM 5.4.3/5.5.1) unless *all* nodes are enabled for the
>ACTIVEdata STGpool.
>
>http://www-1.ibm.com/support/docview.wss?rs=663&context=SSGSG7&dc=DB550&;
>uid=swg1PK59507&loc=en_US&cs=UTF-8&lang=en&rss=ct663tivoli
>
>Here's a slightly improved description of how it should work:
>
>DEFine STGpool actvpool ... POoltype=ACTIVEdata -
>       COLlocate=[No/GRoup/NODe/FIlespace] ...
>COPy DOmain old... new...
>UPDate DOmain new... ACTIVEDESTination=actvpool
>ACTivate POlicy new... somePolicy
>Query SCHedule old... * NOde=node1,...,nodeN   [note old...
>sched.assoc's]
>UPDate NOde nodeX DOmain=new...                [for each node[1-N]
>DEFine ASSOCiation new... [someSched] nodeX    [as previously
>associated]
>COpy ACTIVEdata oldstgpool actvpool    [for each oldstgpool w/active
>backups]
>
>[If no other DOmain except new... has ACTIVEDESTination=actvpool,
> the COpy ACTIVEdata command(s) will copy the Active backups from
>specified
> nodes node[1-N] into the ACTIVEdata STGpool actvpool to expedite DR
>for...]
>
>[But, not recommended until TSM 5.4.3/5.5.1 fixes APAR PK59507!]
>--
>Jim.Owen AT Yale DOT Edu   (203.432.6693)
>
>Steven Harris wrote:
>> Nick
>>
>> I may well have a flawed understanding here but....
>>
>> Set up an active-data pool
>> clone the domain containing the servers requiring recovery
>> set the ACTIVEDATAPOOL parameter on the cloned domain
>> move the servers requiring recovery to the new domain,
>> Run COPY ACTIVEDATA on the primary tape pool
>>
>> Since only the nodes we want are in the domain with the ACTIVEDATAPOOL
>> parameter specified, will not only data from those nodes be copied?
>>
>> Regards
>>
>> Steve
>>
>> Steven Harris
>> TSM Admin, SYdney Australia
>>
>> "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU> wrote on 23/01/2008
>> 11:38:17 AM:
>>
>>> For this scenario, the problem with Active Storagepools is it's a
>>> pool-to-pool relationship.  So ALL active data in a storagepool would
>be
>>> copied to the Active Pool.  Not knowing what percentage of the nodes
>on
>> the
>>> TSM Server will be restored, but assuming they're all in one storage
>> pool,
>>> you'd probably want to "move nodedata" them to another pool, then do
>the
>>> "copy activedata."  Two steps, and needs more resources.  Just doing
>> "move
>>> nodedata" within the same pool will semi-collocate the data (See Note
>>> below).  Obviously, a DASD pool, for this circumstance, would be
>best, if
>>> it's available, but even cycling the data within the existing pool
>will
>>> have benefits.
>>>
>>> Note:  Semi-collocated, as each process will make all of the named
>nodes
>>> data contiguous, even if it ends up on the same media with another
>nodes
>>> data.  Turning on collocation before starting the jobs, and marking
>all
>>> filling volumes read-only, will give you separate volumes for each
>node,
>>> but requires a decent scratch pool to try.
>>>
>>> Nick Cassimatis
>>>
>>> ----- Forwarded by Nicholas Cassimatis/Raleigh/IBM on 01/22/2008
>07:25 PM
>>> -----
>>>
>>> "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU> wrote on 01/22/2008
>>> 01:58:11 PM:
>>>
>>>> Are files that are no longer active automatically expired from the
>>>> activedata pool when you perform the latest COPY ACTIVEDATA?  This
>> would
>>>> mean that, at some point, you would need to do reclamation on this
>> pool,
>>>> right?
>>>>
>>>> It would seem to me that this would be a much better answer to TOP's
>>>> question.  Instead of doing a MOVE NODE (which requires moving ALL
>of
>>>> the node's files), or doing an EXPORT NODE (which requires a
>separate
>>>> server), he can just create an ACTIVEDATA pool, then perform a COPY
>>>> ACTIVEDATA into it while he's preparing for the restore.  Putting
>said
>>>> pool on disk would be even better, of course.
>>>>
>>>> I was just discussing this with another one of our TSM experts, and
>> he's
>>>> not as bullish on it as I am.  (It was an off-list convo, so I'll
>let
>>>> him go nameless unless he wants to speak up.)  He doesn't like that
>you
>>>> can't use a DISK type device class (disk has to be listed as FILE
>> type).
>>>> He also has issues with the resources needed to create this "3rd
>copy"
>>>> of the data.  He said, "Most customers have trouble getting backups
>>>> complete and creating their offsite copies in a 24 hour period and
>> would
>>>> not be able to complete a third copy of the data."  Add to that the
>>>> possibility of doing reclamation on this pool and you've got even
>more
>>>> work to do.
>>>>
>>>> He's more of a fan of group collocation and the multisession restore
>>>> feature.  I think this has more value if you're restoring fewer
>clients
>>>> than you have tape drives.  Because if you collocate all your active
>>>> files, then you'll only be using one tape drive per client.  If
>you've
>>>> got 40 clients to restore and 20 tape drives, I don't see this
>slowing
>>>> you down.  But if you've got one client to restore, and 20 tape
>drives,
>>>> then the multisession restore would probably be faster than a
>> collocated
>>>> restore.
>>>>
>>>> I still think it's a strong feature whose value should be
>investigated
>>>> and discussed -- even if you only use it for the purpose we're
>>>> discussing here.  If you know you're in a DR scenario and you're
>going
>>>> to be restoring multiple systems, why wouldn't you do create an
>>>> ACTIVEDATA pool and do a COPY ACTIVEDATA instead of a MOVE NODE?
>>>>
>>>> OK, here's another question.  Is it assumed that the ACTIVEDATA pool
>>>> have node-level collocation on?  Can you use group collocation
>instead?
>>>> Then maybe I and my friend could both get what we want?
>>>>
>>>> Just throwing thoughts out there.
>>>>
>>>> ---
>>>> W. Curtis Preston
>>>> Backup Blog @ www.backupcentral.com
>>>> VP Data Protection, GlassHouse Technologies
>>>>
>>>> -----Original Message-----
>>>> From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On
>Behalf
>> Of
>>>> Maria Ilieva
>>>> Sent: Tuesday, January 22, 2008 10:22 AM
>>>> To: ADSM-L AT VM.MARIST DOT EDU
>>>> Subject: Re: [ADSM-L] Fw: DISASTER: How to do a LOT of restores?
>>>>
>>>> The procedure of creating active data pools (assuming you have TSM
>>>> version 5.4 or more) is the following:
>>>> 1. Create FILE type DISK pool or sequential TAPE pool specifying
>>>> pooltype=ACTIVEDATA
>>>> 2.Update node's domain(s) specifying ACTIVEDESTINATION=<created
>active
>>>> data pool>
>>>> 3. Issue COPY ACTIVEDATA <node_name>
>>>> This process incrementaly copies node's active data, so it can be
>>>> restarted if needed. HSM migrated and archived data is not copied in
>>>> the active data pool!
>>>>
>>>> Maria Ilieva
>>>>
>>>>> ---
>>>>> W. Curtis Preston
>>>>> Backup Blog @ www.backupcentral.com
>>>>> VP Data Protection, GlassHouse Technologies
>>>>>
>>>>> -----Original Message-----
>>>>> From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On
>Behalf
>>>> Of
>>>>> James R Owen
>>>>> Sent: Tuesday, January 22, 2008 9:32 AM
>>>>> To: ADSM-L AT VM.MARIST DOT EDU
>>>>> Subject: Re: [ADSM-L] Fw: DISASTER: How to do a LOT of restores?
>>>>>
>>>>>
>>>>> Roger,
>>>>> You certainly want to get a "best guess" list of likely priority#1
>>>>> restores.
>>>>> If your tapes really are mostly uncollocated, you will probably
>>>>> experience lots of
>>>>> tape volume contention when you attempt to use MAXPRocess > 1 or to
>>>> run
>>>>> multiple
>>>>> simultaneous restore, move nodedata, or export node operations.
>>>>>
>>>>> Use Query NODEData to see how many tapes might have to be read for
>>>> each
>>>>> node to be
>>>>> restored.
>>>>>
>>>>> To minimize tape mounts, if you can wait for this operation to
>>>> complete,
>>>>> I believe
>>>>> you should try to move or export all of the nodes' data in a single
>>>>> operation.
>>>>>
>>>>> Here are possible disadvantages with using MOVe NODEData:
>>>>>   - does not enable you to select to move only the Active backups
>for
>>>>> these nodes
>>>>>         [so you might have to move lots of extra inactive backups]
>>>>>   - you probably can not effectively use MAXPROC=N (>1 nor run
>>>> multiple
>>>>> simultaneous
>>>>>         MOVe NODEData commands because of contention for your
>>>>> uncollocated volumes.
>>>>>
>>>>> If you have or can set up another TSM server, you could do a
>>>>> Server-Server EXPort:
>>>>>         EXPort Node node1,node2,... FILEData=BACKUPActive
>> TOServer=...
>>>>> [Preview=Yes]
>>>>> moving only the nodes' active backups to a diskpool on the other
>TSM
>>>>> server.  Using
>>>>> this technique, you can move only the minimal necessary data.  I
>> don't
>>>>> see any way
>>>>> to multithread or run multiple simultaneous commands to read more
>> than
>>>>> one tape at
>>>>> a time, but given your drive constraints and uncollocated volumes,
>> you
>>>>> will probably
>>>>> discover that you can not effectively restore, move, or export from
>>>> more
>>>>> than one tape
>>>>> at a time, no matter which technique you try.  Your Query NODEData
>>>>> output should show
>>>>> you which nodes, if any, do *not* have backups on the same tapes.
>>>>>
>>>>> Try running a preview EXPort Node command for single or multiple
>> nodes
>>>>> to get some
>>>>> idea of what tapes will be mounted and how much data you will need
>to
>>>>> export.
>>>>>
>>>>> Call me if you want to talk about any of this.
>>>>> --
>>>>> Jim.Owen AT Yale DOT Edu   (w#203.432.6693, Verizon c#203.494.9201)
>>>>>
>>>>> Roger Deschner wrote:
>>>>>> MOVE NODEDATA looks like it is going to be the key. I will simply
>>>> move
>>>>>> the affected nodes into a disk storage pool, or into our existing
>>>>>> collocated tape storage pool. I presume it should be possible to
>>>>> restart
>>>>>> MOVE NODEDATA, in case it has to be interrupted or if the server
>>>>>> crashes, because what it does is not very different from migration
>>>> or
>>>>>> relcamation. This should be a big advantage over GENERATE
>> BACKUPSET,
>>>>>> which is not even as restartable as a common client restore. A
>>>>> possible
>>>>>> strategy is to do the long, laborious, but restartable, MOVE
>>>> NODEDATA
>>>>>> first, and then do a very quick, painless, regular client restore
>> or
>>>>>> GENERATE BACKUPSET.
>>>>>>
>>>>>> Thanks to all! Until now, I was not fully aware of MOVE NODEDATA.
>>>>>>
>>>>>> B.T.W. It is an automatic tape library, Quantum P7000. We
>graduated
>>>>> from
>>>>>> manual tape mounting back in 1999.
>>>>>>
>>>>>> Roger Deschner      University of Illinois at Chicago
>>>>> rogerd AT uic DOT edu
>>>>>>
>>>>>> On Tue, 22 Jan 2008, Nicholas Cassimatis wrote:
>>>>>>
>>>>>>> Roger,
>>>>>>>
>>>>>>> If you know which nodes are to be restored, or at least have some
>>>>> that are
>>>>>>> good suspects, you might want to run some "move nodedata"
>commands
>>>> to
>>>>> try
>>>>>>> to get their data more contiguous.  If you can get some of that
>>>> DASD
>>>>> that's
>>>>>>> coming "real soon," even just to borrow it, that would help out
>>>>>>> tremendously.
>>>>>>>
>>>>>>> You say "tape" but never "library" - are you on manual drives?
>>>>> (Please say
>>>>>>> No, please say No...)  Try setting the mount retention high on
>>>> them,
>>>>> and
>>>>>>> kick off a few restores at once.  You may get lucky and already
>>>> have
>>>>> the
>>>>>>> needed tape mounted, saving you a few mounts.  If that's not
>>>> working
>>>>> (it's
>>>>>>> impossible to predict which way it will go), drop the mount
>>>> retention
>>>>> to 0
>>>>>>> so the tape ejects immediately, so the drive is ready for a new
>>>> tape
>>>>>>> sooner.  And if you are, try to recruit the people who haven't
>>>>> approved
>>>>>>> spending for the upgrades to be the "picker arm" for you - I did
>>>> that
>>>>> to an
>>>>>>> account manager on a DR Test once, and we got the library
>approved
>>>>> the next
>>>>>>> day.
>>>>>>>
>>>>>>> The thoughts of your fellow TSMers are with you.
>>>>>>>
>>>>>>> Nick Cassimatis
>>>>>>>
>>>>>>> ----- Forwarded by Nicholas Cassimatis/Raleigh/IBM on 01/22/2008
>>>>> 08:08 AM
>>>>>>> -----
>>>>>>>
>>>>>>> "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU> wrote on
>>>> 01/22/2008
>>>>>>> 03:40:07 AM:
>>>>>>>
>>>>>>>> We like to talk about disaster preparedness, and one just
>> happened
>>>>> here
>>>>>>>> at UIC.
>>>>>>>>
>>>>>>>> On Saturday morning, a fire damaged portions of the UIC College
>> of
>>>>>>>> Pharmacy Building. It affected several laboratories and offices.
>>>> The
>>>>>>>> Chicago Fire Department, wearing hazmat moon suits due to the
>>>> highly
>>>>>>>> dangerous contents of the laboratories, put it out efficiently
>in
>>>>> about
>>>>>>>> 15 minutes. The temperature was around 0F (-18C), which
>> compounded
>>>>> the
>>>>>>>> problems - anything that took on water became a block of ice.
>>>>>>>> Fortunately nobody was hurt; only a few people were in the
>>>> building
>>>>> on a
>>>>>>>> Saturday morning, and they all got out safely.
>>>>>>>>
>>>>>>>> Now, both the good news and the bad news is that many of the
>>>> damaged
>>>>>>>> computers were backed up to our large TSM system. The good news
>> is
>>>>> that
>>>>>>>> their data can be restored.
>>>>>>>>
>>>>>>>> The bad news is that their data can be restored. And so now it
>>>> must
>>>>> be.
>>>>>>>> Our TSM system is currently an old-school tape-based setup from
>>>> the
>>>>> ADSM
>>>>>>>> days. (Upgrades involving a lot more disk coming real soon!)
>Most
>>>> of
>>>>> the
>>>>>>>> nodes affected are not collocated, so I have to plan to do a
>>>> number
>>>>> of
>>>>>>>> full restores of nodes whose data is scattered across numerous
>>>> tape
>>>>>>>> volumes each. There are only 8 tape drives, and they are kept
>> busy
>>>>> since
>>>>>>>> this system is in a heavily-loaded, about-to-be-upgraded state.
>>>>> (Timing
>>>>>>>> couldn't be worse; Murphy's Law.)
>>>>>>>>
>>>>>>>> TSM was recently upgraded to version 5.5.0.0. It runs on AIX 5.3
>>>>> with a
>>>>>>>> SCSI library. Since it is a v5.5 server, there may be new
>>>> facilities
>>>>>>>> available that I'm not aware of yet.
>>>>>>>>
>>>>>>>> I have the luxury of a little bit of time in advance. The hazmat
>>>>> guys
>>>>>>>> aren't letting anyone in to asess damage yet, so we don't know
>>>> which
>>>>>>>> client node computers are damaged or not. We should know in a
>day
>>>> or
>>>>>>>> two, so in the meantime I'm running as much reclamation as
>>>> possible.
>>>>>>>> Given that this is our situation, how can I best optimize these
>>>>>>>> restores? I'm looking for ideas to get the most restoration done
>>>> for
>>>>>>>> this disaster, while still continuing normal client-backup,
>>>>> migration,
>>>>>>>> expiration, reclamation cycles, because somebody else unrelated
>> to
>>>>> this
>>>>>>>> situation could also need to restore...
>>>>>>>>
>>>>>>>> Roger Deschner      University of Illinois at Chicago
>>>>> rogerd AT uic DOT edu
>>>>>
>