ADSM-L

our First DR test (Long)

1997-10-17 14:48:21
Subject: our First DR test (Long)
From: MATTHEW SPARKS 05-025 <MSPARKS AT PINKERTONS DOT COM>
Date: Fri, 17 Oct 1997 11:48:21 -0700
Hi Everyone!

Well we just survived our First DR test.
We being Pinkerton, My name is Matthew Sparks, I am the Sr System   
Administrator (UNIX) here.
We run RS6000 AIX 4.14, 4.15, 4.21 on J30-R50's. Our Database is Informix   
(yes we can back it
up and restore it with onbar, but we are not doing it live currently) and   our 
Apps are all PeopleSoft.
Our Workstations are NT 4.0

The first test was to bring up our ADSM Server (Jesse-James) and our HR   
Server (Frank-James),
restore them using mksysb recreate the VGs LVs, and FS's then to restore   the 
Datbase and ADSM
tapes from a copy of the storage pools. (We do have DRM). Next we would   
restore the HRserver(Frank-James)
and then restore our NT Primary Domain Controller and then restore the   Main 
NT server on the Network.

When I checked with DRM to see how many tapes we would be restoring from    the 
Current Disaster storage pool
contained 147 tapes. We have a 3494 tape library with two 3590 tape   drives. 
At the DR Site IBM was supplying
us with two 3590 tape drives and no library. No way I was going to be the   
automounter for 147 tapes. So I created
A new dr copy pool and copied it, .... 30 hours later it finished. 16   full 
3590 tapes.(we averaged 21 gigs per tape.)
we also tar'd the /ADSM directory which contains my DB and device config   
files. I had the DBA's back up the
informix data on 8mm. We also exported a copy of the DB onto a 3590 tape.

after making 2 copies of mksysb tapes for each machine and 2 copies of   the 
/ADSM directory, 2 copies of the
Informix data and 2 copies of all the NT OS disks and repair disks, we   packed 
it all up and took off for Sterling
Forest New York.

We took two planes just in case... and made sure to have the media we   were 
carrying hand searched as opposed
to Xrayed (ps. It's not the xrays, its the electro magnets used to make   the 
xrays that gets you)

After an initial tour we slept and then started at 8:00am
I loaded the mksysb
only one worked, luckily I had a spare.
that took an hour and a half.
I recreated the VG LV and FS on frank james
I reloaded the /ADSM directory on Jesse James but for some reason ADSM   
thought it was corrupted
Then for the Next 6 hours we tried to get ADSM to recognize the 3590 tape   
with the DB on it. finally we
 figured out I needed a new copy of the ATAPE driver. So we tried to get   it 
off the Internet and IBM ftp servers.
They were all down.
another hour went by and someone at Sterling Forest found a copy in his   desk 
drawer. it wasn't the most current
but it was ok. We loaded that and still ADSM could not "See" the 3590   tape 
drive. Our IBM Recovery service rep was
examining other similar cases and she noticed that one group had done it   by 
making the 3590 drive a manual device.
So we tried that and hooray it sucked in the db and started restoring   itself.
Another hour went by before we were able to get it to talk with the   second 
3590 which was in random mode and had 10 of
the 16 data tapes.
finally just about 20 hours into the test. the copy pool checked in.
I updated all the other pools as destroyed and began a restore on Frank   James 
it worked!!!!

The one thing I couldn't mark destroyed was the disaster-recovery pool   (the 
147 tapes) because it was a copy.(this doesn't
make a whole lot of sense to me but oh well) I just left it figuring who   
cares....
but the disaster -recovery pool was tied to the 3494 library, so it kept   
trying to find the library and process the tapes.
i was asleep by this time.
the NT guys started restoring the domain controller, they woke me up to   
checkin a new volume. and keep restoring
I went back to sleep. they started restoring the big network server, and   it 
worked up to about the last 10 meg or so (users data)
when it asked for a new tape. they didn't notice and by the time they did   the 
process had quit.
rather then restore again (about a three hour job) they elected to go on.

Mean while the system had tried schedule normal backups and trim files   and 
expire data. and I think that caused some problems
when it couldn't find certain tapes in the library it didn't have   anymore. So 
when we tried to load the last drive on the main server.
it came back with the message that data was not available...

The NT guys were able to recreate that boot partition from the disks we   had 
brought so we were ok. after that I couldn't get ADSM to
restore anything.
Next time I will instantly turn off all schedules and hopefully I can   find 
out how to set the system not to expire anything.

at 36 hours into the test the users were let loose on the workstations in   
Costa Mesa, CA and although the 512K link was slow it worked and
they were able to process jobs.

total test time 41.5 hours

Things I learned

Make sure you are disaster recoverying onto like hardware.
if we had a 3494, I would have been up in 6 hours instead of 24

mksysb tapes are just as important as the adsm tapes.

have at least two copies of everything.

turn off the backup schedules asap

make sure your drivers are up to date.


We go back next month to do it all over again..... faster.


Any Questions ???

Matthew Makaala Sparks                          Desk (818) 380-8712
Senior Technical Support Specialist             Fax  (818) 380-8677
Pinkerton Security & Investigation Services
15910 Ventura Blvd.; Suite 900
Encino, CA  91436                               Ham Radio KE6GVI
  email = MSparks AT Pinkertons DOT com
 ---------------------------------------------------------------------
 Say "Plugh"...                                 "XYZZY"
<Prev in Thread] Current Thread [Next in Thread>