ADSM-L

Re: [ADSM-L] DISASTER: How to do a LOT of restores?

2008-01-22 08:10:26
Subject: Re: [ADSM-L] DISASTER: How to do a LOT of restores?
From: Dominique Laflamme <dominique.laflamme.mbh7 AT STATEFARM DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 22 Jan 2008 06:09:44 -0700
I would use MOVE NODEDATA commands to move the data for the effected
nodes to a (?new?) collocated pool before they start trying to do their
restores. That lets you get a lot of the tape mounts and so forth out of
the way while the clients aren't ready yet to be restored. You can pace
how much this is done and when it is done by how many MOVE NODEDATAs you
have running and when you run them manually. It won't solve the fact
that you're running at capacity, but it will let you minimize restore
times for the victims of the fire. 

Just a thought,
Nick

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Roger Deschner
Sent: Tuesday, January 22, 2008 2:40 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: [ADSM-L] DISASTER: How to do a LOT of restores?

We like to talk about disaster preparedness, and one just happened here
at UIC.

On Saturday morning, a fire damaged portions of the UIC College of
Pharmacy Building. It affected several laboratories and offices. The
Chicago Fire Department, wearing hazmat moon suits due to the highly
dangerous contents of the laboratories, put it out efficiently in about
15 minutes. The temperature was around 0F (-18C), which compounded the
problems - anything that took on water became a block of ice.
Fortunately nobody was hurt; only a few people were in the building on a
Saturday morning, and they all got out safely.

Now, both the good news and the bad news is that many of the damaged
computers were backed up to our large TSM system. The good news is that
their data can be restored.

The bad news is that their data can be restored. And so now it must be.

Our TSM system is currently an old-school tape-based setup from the ADSM
days. (Upgrades involving a lot more disk coming real soon!) Most of the
nodes affected are not collocated, so I have to plan to do a number of
full restores of nodes whose data is scattered across numerous tape
volumes each. There are only 8 tape drives, and they are kept busy since
this system is in a heavily-loaded, about-to-be-upgraded state. (Timing
couldn't be worse; Murphy's Law.)

TSM was recently upgraded to version 5.5.0.0. It runs on AIX 5.3 with a
SCSI library. Since it is a v5.5 server, there may be new facilities
available that I'm not aware of yet.

I have the luxury of a little bit of time in advance. The hazmat guys
aren't letting anyone in to asess damage yet, so we don't know which
client node computers are damaged or not. We should know in a day or
two, so in the meantime I'm running as much reclamation as possible.

Given that this is our situation, how can I best optimize these
restores? I'm looking for ideas to get the most restoration done for
this disaster, while still continuing normal client-backup, migration,
expiration, reclamation cycles, because somebody else unrelated to this
situation could also need to restore...

Roger Deschner      University of Illinois at Chicago     rogerd AT uic DOT edu