Re: [ADSM-L] Fw: DISASTER: How to do a LOT of restores?

Thanks Curtis, and everyone else too. It's great to have this much
assembled expertise out there.

Preliminary indications from a quick look at the building are that we're
going to be dealing with a count of nodes in the dozens, not hundreds.

The next step is for the IT guy of the affected department (thank
goodness not me!) to put on a moon suit and go into the burned area with
a clipboard to take inventory, escorted by the Chicago Fire Department's
hazmat unit. That's supposed to happen sometime today. He'll be
gathering a preliminary list of obviously damaged machines, and then
we'll prioritize them.

My next step then will be to calculate the ratio of active/inactive data
for them, which will help to determine if there will be much of a
benefit to EXPORT NODE ACTIVEDATA or COPY ACTIVEDATA, as opposed to MOVE
NODEDATA. If I can get a quick read as to the active/inactive data
ratio, by comparing the output of Q FILESPACE to Q OCC, then these
decisions can be made intelegently. (In the future, I may keep a spare
server image around just for DR.)

The less inactive data there is, the better MOVE NODEDATA sounds. My
strategy would be to move the data into an existing collocated
heirarchy, into its primary disk stgpool. Then I'd let normal migration
move it once per day as the MOVE NODEDATAs progressed, onto truly
collocated tapes. Once there, normal client restore or GENERATE
BACKUPSET operations will go quickly.

A distinct disadvantage to the EXPORT strategy is that we're already in
a DR situation, so setting up a new server would take some time.

I have already reduced the mount retention time to 0.

I also may have more NFS space available than I had first thought. A
hundred gigs here and there, adds up to a lot of space. Since a type
FILE stgpool can span multiple Unix filesystems as of TSM 5.3, I can
probably cobble togther significant space if I need it.

Roger Deschner           ACCC Basic Systems Group         rogerd AT uic DOT edu



On Tue, 22 Jan 2008, Curtis Preston wrote:

>Are files that are no longer active automatically expired from the
>activedata pool when you perform the latest COPY ACTIVEDATA?  This would
>mean that, at some point, you would need to do reclamation on this pool,
>right?
>
>It would seem to me that this would be a much better answer to TOP's
>question.  Instead of doing a MOVE NODE (which requires moving ALL of
>the node's files), or doing an EXPORT NODE (which requires a separate
>server), he can just create an ACTIVEDATA pool, then perform a COPY
>ACTIVEDATA into it while he's preparing for the restore.  Putting said
>pool on disk would be even better, of course.
>
>I was just discussing this with another one of our TSM experts, and he's
>not as bullish on it as I am.  (It was an off-list convo, so I'll let
>him go nameless unless he wants to speak up.)  He doesn't like that you
>can't use a DISK type device class (disk has to be listed as FILE type).
>
>He also has issues with the resources needed to create this "3rd copy"
>of the data.  He said, "Most customers have trouble getting backups
>complete and creating their offsite copies in a 24 hour period and would
>not be able to complete a third copy of the data."  Add to that the
>possibility of doing reclamation on this pool and you've got even more
>work to do.
>
>He's more of a fan of group collocation and the multisession restore
>feature.  I think this has more value if you're restoring fewer clients
>than you have tape drives.  Because if you collocate all your active
>files, then you'll only be using one tape drive per client.  If you've
>got 40 clients to restore and 20 tape drives, I don't see this slowing
>you down.  But if you've got one client to restore, and 20 tape drives,
>then the multisession restore would probably be faster than a collocated
>restore.
>
>I still think it's a strong feature whose value should be investigated
>and discussed -- even if you only use it for the purpose we're
>discussing here.  If you know you're in a DR scenario and you're going
>to be restoring multiple systems, why wouldn't you do create an
>ACTIVEDATA pool and do a COPY ACTIVEDATA instead of a MOVE NODE?
>
>OK, here's another question.  Is it assumed that the ACTIVEDATA pool
>have node-level collocation on?  Can you use group collocation instead?
>Then maybe I and my friend could both get what we want?
>
>Just throwing thoughts out there.
>
>---
>W. Curtis Preston
>Backup Blog @ www.backupcentral.com
>VP Data Protection, GlassHouse Technologies
>
>-----Original Message-----
>From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
>Maria Ilieva
>Sent: Tuesday, January 22, 2008 10:22 AM
>To: ADSM-L AT VM.MARIST DOT EDU
>Subject: Re: [ADSM-L] Fw: DISASTER: How to do a LOT of restores?
>
>The procedure of creating active data pools (assuming you have TSM
>version 5.4 or more) is the following:
>1. Create FILE type DISK pool or sequential TAPE pool specifying
>pooltype=ACTIVEDATA
>2.Update node's domain(s) specifying ACTIVEDESTINATION=<created active
>data pool>
>3. Issue COPY ACTIVEDATA <node_name>
>This process incrementaly copies node's active data, so it can be
>restarted if needed. HSM migrated and archived data is not copied in
>the active data pool!
>
>Maria Ilieva
>
>> ---
>> W. Curtis Preston
>> Backup Blog @ www.backupcentral.com
>> VP Data Protection, GlassHouse Technologies
>>
>> -----Original Message-----
>> From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf
>Of
>> James R Owen
>> Sent: Tuesday, January 22, 2008 9:32 AM
>> To: ADSM-L AT VM.MARIST DOT EDU
>> Subject: Re: [ADSM-L] Fw: DISASTER: How to do a LOT of restores?
>>
>>
>> Roger,
>> You certainly want to get a "best guess" list of likely priority#1
>> restores.
>> If your tapes really are mostly uncollocated, you will probably
>> experience lots of
>> tape volume contention when you attempt to use MAXPRocess > 1 or to
>run
>> multiple
>> simultaneous restore, move nodedata, or export node operations.
>>
>> Use Query NODEData to see how many tapes might have to be read for
>each
>> node to be
>> restored.
>>
>> To minimize tape mounts, if you can wait for this operation to
>complete,
>> I believe
>> you should try to move or export all of the nodes' data in a single
>> operation.
>>
>> Here are possible disadvantages with using MOVe NODEData:
>>   - does not enable you to select to move only the Active backups for
>> these nodes
>>         [so you might have to move lots of extra inactive backups]
>>   - you probably can not effectively use MAXPROC=N (>1 nor run
>multiple
>> simultaneous
>>         MOVe NODEData commands because of contention for your
>> uncollocated volumes.
>>
>> If you have or can set up another TSM server, you could do a
>> Server-Server EXPort:
>>         EXPort Node node1,node2,... FILEData=BACKUPActive TOServer=...
>> [Preview=Yes]
>> moving only the nodes' active backups to a diskpool on the other TSM
>> server.  Using
>> this technique, you can move only the minimal necessary data.  I don't
>> see any way
>> to multithread or run multiple simultaneous commands to read more than
>> one tape at
>> a time, but given your drive constraints and uncollocated volumes, you
>> will probably
>> discover that you can not effectively restore, move, or export from
>more
>> than one tape
>> at a time, no matter which technique you try.  Your Query NODEData
>> output should show
>> you which nodes, if any, do *not* have backups on the same tapes.
>>
>> Try running a preview EXPort Node command for single or multiple nodes
>> to get some
>> idea of what tapes will be mounted and how much data you will need to
>> export.
>>
>> Call me if you want to talk about any of this.
>> --
>> Jim.Owen AT Yale DOT Edu   (w#203.432.6693, Verizon c#203.494.9201)
>>
>> Roger Deschner wrote:
>> > MOVE NODEDATA looks like it is going to be the key. I will simply
>move
>> > the affected nodes into a disk storage pool, or into our existing
>> > collocated tape storage pool. I presume it should be possible to
>> restart
>> > MOVE NODEDATA, in case it has to be interrupted or if the server
>> > crashes, because what it does is not very different from migration
>or
>> > relcamation. This should be a big advantage over GENERATE BACKUPSET,
>> > which is not even as restartable as a common client restore. A
>> possible
>> > strategy is to do the long, laborious, but restartable, MOVE
>NODEDATA
>> > first, and then do a very quick, painless, regular client restore or
>> > GENERATE BACKUPSET.
>> >
>> > Thanks to all! Until now, I was not fully aware of MOVE NODEDATA.
>> >
>> > B.T.W. It is an automatic tape library, Quantum P7000. We graduated
>> from
>> > manual tape mounting back in 1999.
>> >
>> > Roger Deschner      University of Illinois at Chicago
>> rogerd AT uic DOT edu
>> >
>> >
>> > On Tue, 22 Jan 2008, Nicholas Cassimatis wrote:
>> >
>> >> Roger,
>> >>
>> >> If you know which nodes are to be restored, or at least have some
>> that are
>> >> good suspects, you might want to run some "move nodedata" commands
>to
>> try
>> >> to get their data more contiguous.  If you can get some of that
>DASD
>> that's
>> >> coming "real soon," even just to borrow it, that would help out
>> >> tremendously.
>> >>
>> >> You say "tape" but never "library" - are you on manual drives?
>> (Please say
>> >> No, please say No...)  Try setting the mount retention high on
>them,
>> and
>> >> kick off a few restores at once.  You may get lucky and already
>have
>> the
>> >> needed tape mounted, saving you a few mounts.  If that's not
>working
>> (it's
>> >> impossible to predict which way it will go), drop the mount
>retention
>> to 0
>> >> so the tape ejects immediately, so the drive is ready for a new
>tape
>> >> sooner.  And if you are, try to recruit the people who haven't
>> approved
>> >> spending for the upgrades to be the "picker arm" for you - I did
>that
>> to an
>> >> account manager on a DR Test once, and we got the library approved
>> the next
>> >> day.
>> >>
>> >> The thoughts of your fellow TSMers are with you.
>> >>
>> >> Nick Cassimatis
>> >>
>> >> ----- Forwarded by Nicholas Cassimatis/Raleigh/IBM on 01/22/2008
>> 08:08 AM
>> >> -----
>> >>
>> >> "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU> wrote on
>01/22/2008
>> >> 03:40:07 AM:
>> >>
>> >>> We like to talk about disaster preparedness, and one just happened
>> here
>> >>> at UIC.
>> >>>
>> >>> On Saturday morning, a fire damaged portions of the UIC College of
>> >>> Pharmacy Building. It affected several laboratories and offices.
>The
>> >>> Chicago Fire Department, wearing hazmat moon suits due to the
>highly
>> >>> dangerous contents of the laboratories, put it out efficiently in
>> about
>> >>> 15 minutes. The temperature was around 0F (-18C), which compounded
>> the
>> >>> problems - anything that took on water became a block of ice.
>> >>> Fortunately nobody was hurt; only a few people were in the
>building
>> on a
>> >>> Saturday morning, and they all got out safely.
>> >>>
>> >>> Now, both the good news and the bad news is that many of the
>damaged
>> >>> computers were backed up to our large TSM system. The good news is
>> that
>> >>> their data can be restored.
>> >>>
>> >>> The bad news is that their data can be restored. And so now it
>must
>> be.
>> >>>
>> >>> Our TSM system is currently an old-school tape-based setup from
>the
>> ADSM
>> >>> days. (Upgrades involving a lot more disk coming real soon!) Most
>of
>> the
>> >>> nodes affected are not collocated, so I have to plan to do a
>number
>> of
>> >>> full restores of nodes whose data is scattered across numerous
>tape
>> >>> volumes each. There are only 8 tape drives, and they are kept busy
>> since
>> >>> this system is in a heavily-loaded, about-to-be-upgraded state.
>> (Timing
>> >>> couldn't be worse; Murphy's Law.)
>> >>>
>> >>> TSM was recently upgraded to version 5.5.0.0. It runs on AIX 5.3
>> with a
>> >>> SCSI library. Since it is a v5.5 server, there may be new
>facilities
>> >>> available that I'm not aware of yet.
>> >>>
>> >>> I have the luxury of a little bit of time in advance. The hazmat
>> guys
>> >>> aren't letting anyone in to asess damage yet, so we don't know
>which
>> >>> client node computers are damaged or not. We should know in a day
>or
>> >>> two, so in the meantime I'm running as much reclamation as
>possible.
>> >>>
>> >>> Given that this is our situation, how can I best optimize these
>> >>> restores? I'm looking for ideas to get the most restoration done
>for
>> >>> this disaster, while still continuing normal client-backup,
>> migration,
>> >>> expiration, reclamation cycles, because somebody else unrelated to
>> this
>> >>> situation could also need to restore...
>> >>>
>> >>> Roger Deschner      University of Illinois at Chicago
>> rogerd AT uic DOT edu
>>
>