ADSM-L

[ADSM-L] DISASTER: How to do a LOT of restores?

2008-01-22 03:40:43
Subject: [ADSM-L] DISASTER: How to do a LOT of restores?
From: Roger Deschner <rogerd AT UIC DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 22 Jan 2008 02:40:07 -0600
We like to talk about disaster preparedness, and one just happened here
at UIC.

On Saturday morning, a fire damaged portions of the UIC College of
Pharmacy Building. It affected several laboratories and offices. The
Chicago Fire Department, wearing hazmat moon suits due to the highly
dangerous contents of the laboratories, put it out efficiently in about
15 minutes. The temperature was around 0F (-18C), which compounded the
problems - anything that took on water became a block of ice.
Fortunately nobody was hurt; only a few people were in the building on a
Saturday morning, and they all got out safely.

Now, both the good news and the bad news is that many of the damaged
computers were backed up to our large TSM system. The good news is that
their data can be restored.

The bad news is that their data can be restored. And so now it must be.

Our TSM system is currently an old-school tape-based setup from the ADSM
days. (Upgrades involving a lot more disk coming real soon!) Most of the
nodes affected are not collocated, so I have to plan to do a number of
full restores of nodes whose data is scattered across numerous tape
volumes each. There are only 8 tape drives, and they are kept busy since
this system is in a heavily-loaded, about-to-be-upgraded state. (Timing
couldn't be worse; Murphy's Law.)

TSM was recently upgraded to version 5.5.0.0. It runs on AIX 5.3 with a
SCSI library. Since it is a v5.5 server, there may be new facilities
available that I'm not aware of yet.

I have the luxury of a little bit of time in advance. The hazmat guys
aren't letting anyone in to asess damage yet, so we don't know which
client node computers are damaged or not. We should know in a day or
two, so in the meantime I'm running as much reclamation as possible.

Given that this is our situation, how can I best optimize these
restores? I'm looking for ideas to get the most restoration done for
this disaster, while still continuing normal client-backup, migration,
expiration, reclamation cycles, because somebody else unrelated to this
situation could also need to restore...

Roger Deschner      University of Illinois at Chicago     rogerd AT uic DOT edu