Hi Everyone!
Well we just survived our First DR test.
We being Pinkerton, My name is Matthew Sparks, I am the Sr System
Administrator (UNIX) here.
We run RS6000 AIX 4.14, 4.15, 4.21 on J30-R50's. Our Database is Informix
(yes we can back it
up and restore it with onbar, but we are not doing it live currently) and our
Apps are all PeopleSoft.
Our Workstations are NT 4.0
The first test was to bring up our ADSM Server (Jesse-James) and our HR
Server (Frank-James),
restore them using mksysb recreate the VGs LVs, and FS's then to restore the
Datbase and ADSM
tapes from a copy of the storage pools. (We do have DRM). Next we would
restore the HRserver(Frank-James)
and then restore our NT Primary Domain Controller and then restore the Main
NT server on the Network.
When I checked with DRM to see how many tapes we would be restoring from the
Current Disaster storage pool
contained 147 tapes. We have a 3494 tape library with two 3590 tape drives.
At the DR Site IBM was supplying
us with two 3590 tape drives and no library. No way I was going to be the
automounter for 147 tapes. So I created
A new dr copy pool and copied it, .... 30 hours later it finished. 16 full
3590 tapes.(we averaged 21 gigs per tape.)
we also tar'd the /ADSM directory which contains my DB and device config
files. I had the DBA's back up the
informix data on 8mm. We also exported a copy of the DB onto a 3590 tape.
after making 2 copies of mksysb tapes for each machine and 2 copies of the
/ADSM directory, 2 copies of the
Informix data and 2 copies of all the NT OS disks and repair disks, we packed
it all up and took off for Sterling
Forest New York.
We took two planes just in case... and made sure to have the media we were
carrying hand searched as opposed
to Xrayed (ps. It's not the xrays, its the electro magnets used to make the
xrays that gets you)
After an initial tour we slept and then started at 8:00am
I loaded the mksysb
only one worked, luckily I had a spare.
that took an hour and a half.
I recreated the VG LV and FS on frank james
I reloaded the /ADSM directory on Jesse James but for some reason ADSM
thought it was corrupted
Then for the Next 6 hours we tried to get ADSM to recognize the 3590 tape
with the DB on it. finally we
figured out I needed a new copy of the ATAPE driver. So we tried to get it
off the Internet and IBM ftp servers.
They were all down.
another hour went by and someone at Sterling Forest found a copy in his desk
drawer. it wasn't the most current
but it was ok. We loaded that and still ADSM could not "See" the 3590 tape
drive. Our IBM Recovery service rep was
examining other similar cases and she noticed that one group had done it by
making the 3590 drive a manual device.
So we tried that and hooray it sucked in the db and started restoring itself.
Another hour went by before we were able to get it to talk with the second
3590 which was in random mode and had 10 of
the 16 data tapes.
finally just about 20 hours into the test. the copy pool checked in.
I updated all the other pools as destroyed and began a restore on Frank James
it worked!!!!
The one thing I couldn't mark destroyed was the disaster-recovery pool (the
147 tapes) because it was a copy.(this doesn't
make a whole lot of sense to me but oh well) I just left it figuring who
cares....
but the disaster -recovery pool was tied to the 3494 library, so it kept
trying to find the library and process the tapes.
i was asleep by this time.
the NT guys started restoring the domain controller, they woke me up to
checkin a new volume. and keep restoring
I went back to sleep. they started restoring the big network server, and it
worked up to about the last 10 meg or so (users data)
when it asked for a new tape. they didn't notice and by the time they did the
process had quit.
rather then restore again (about a three hour job) they elected to go on.
Mean while the system had tried schedule normal backups and trim files and
expire data. and I think that caused some problems
when it couldn't find certain tapes in the library it didn't have anymore. So
when we tried to load the last drive on the main server.
it came back with the message that data was not available...
The NT guys were able to recreate that boot partition from the disks we had
brought so we were ok. after that I couldn't get ADSM to
restore anything.
Next time I will instantly turn off all schedules and hopefully I can find
out how to set the system not to expire anything.
at 36 hours into the test the users were let loose on the workstations in
Costa Mesa, CA and although the 512K link was slow it worked and
they were able to process jobs.
total test time 41.5 hours
Things I learned
Make sure you are disaster recoverying onto like hardware.
if we had a 3494, I would have been up in 6 hours instead of 24
mksysb tapes are just as important as the adsm tapes.
have at least two copies of everything.
turn off the backup schedules asap
make sure your drivers are up to date.
We go back next month to do it all over again..... faster.
Any Questions ???
Matthew Makaala Sparks Desk (818) 380-8712
Senior Technical Support Specialist Fax (818) 380-8677
Pinkerton Security & Investigation Services
15910 Ventura Blvd.; Suite 900
Encino, CA 91436 Ham Radio KE6GVI
email = MSparks AT Pinkertons DOT com
---------------------------------------------------------------------
Say "Plugh"... "XYZZY"
|