ADSM-L

Our first drp recovery of an SP system.

1997-08-20 21:57:19
Subject: Our first drp recovery of an SP system.
From: Peter Zutenis <pzutenis AT IBM DOT NET>
Date: Thu, 21 Aug 1997 11:57:19 +1000
Hi All,

I have just done our first SP recovery offsite and thought I'd share my 
experiences with you. I have mainly 'lurked' in this list and figure it is 
'payback' time.

Our site is a three frame SP with 21 nodes and around 1.2 TB of ssa disk. Cw is 
a 42t, all nodes running AIX 4.1.4 and PSSP 2.2. In Australia at this time no 
drp service provider has enough hardware to allow a recovery of all our nodes, 
so we just recovered three nodes (one wide and 2 thins) with around 100GB SSA.
All our nodes have two internal disks , one being rootvg and the other being 
altvg. (Yes I know I should mirror rootvg to the second disk, but these nodes 
were installed before that was supported on SP).   

First off, our current backup strategy uses SYSBACK to back up the Control 
Workstation weekly to 2 8mm tapes. I also use sysback to backup the adsm server 
(on a wide node). I also do weekly mksysb of all my nodes to the CW every week. 
My nodes are all different and its too hard to use the same image for all 
nodes. I do daily primary storage pool backups to an offsite copy pool.  

Now to the good stuff. 

My first worry was that the CW at the hotsite was not the same as ours. We have 
a 42t, the hotsite a C10. My fears were unfounded. The SYSback restore of the 
CW (rootvg and altvg (where /spdata lives)) restored beautifully. I think that 
both are microchannel machines and the hdisk layout was the same helped a lot. 

Once the control workstation was restored, and the SDR reconfigured for the new 
SP layout, /etc/hosts file modified and other SP Stuff that I wont go into,  it 
was time to NIM install the wide node running ADSM. I also installed the other 
two thin nodes at this stage.

The image restored to the WIDE node ok and I used SYSBACK to recover the ALTVG 
that contained the adsm, databases and disk storage pools. This went ok.

SP Parallel switch was re-configured and started Ok. 

I restored the Device Config file and the volhistory from a backup diskette 
that is created daily after the ADSM DB backup. (I only do full backups - the 
db is only 1.2GB in size at the moment). I manually edited the device config 
file to match the devices at the DR site. 

The next step was to restore the ADSM DB (didnt trust the physical image of the 
one from the SYSBACK restore.). This also went smoothly. 

ADSM Server was started and the primary storage pools were marked destroyed.  I 
also deleted the old drives (from home site) and added the drives for the DR 
Site. I also had to re-apply the correct license codes for the ADSM Server.

Then some trouble happened. I selected at random a file to restore to the CW 
(just a plain old text file) to test the adsm recovery and also to see if it 
would use the copy pool tapes. The ADSM server refused to restore the file and 
issued the following message: ANR0540W Retrieve or Restore failed for 
<filename> . Data integrity error detected.

I started to sweat a little at this stage. 

So I selected another file to try - same result.

I started to sweat a little more.

I started to run an auditdb - four hours later I cancelled it as I was running 
out of time.

So I thought - nothing to lose - lets restore the SAP/Oracle excutables on one 
of the thin nodes. I recreated the filesystems from a script that I keep with 
the current filesystem layout defined, started the adsm restore of these file 
systems and lo and behold it worked fine. 

I started sweating less.

Next Step was to restore the oracle db for SAP. (via backint). This step worked 
fine. 

I stopped sweating.

One interesting problem I ran into was that symbolic links on the restore of 
the SAP/Oracle filesystems didnt work. The Symbolic links pointed to no where. 
I had to manually delete these links and re-do the links by hand. If you are 
familiar with SAP R/3 and oracle, then this was real fun (not). 

I discovered APAR IX70295 that describes this situation. The solution is to put 
USELARGebuffers No into the dsm.sys. Wish I knew it when I did the test put I 
guess that is what tests are for. (I havent tried using USELARGebuffers yet). 

ADSM also didnt restore some directories that were empty. These had to be 
manually created. (Discovered this when SAP wouldnt start). Has anyone heard of 
this before ?? 

Anyway, I was able to get SAP started and a SAP Gui going. I quick check of the 
system showed all was good. (Yahoo!) 

I have contacted IBM re the database integrity error. Interestingly enough I 
could restore that file on our home system when I got back.  Also, whilst at 
the DR site I was able to restore the previous version of that file ok. Maybe 
this file was being backed up by the client as at the same time of the DB 
Backup ? - I dont know for sure. If so then this is bad. Just my luck to pick a 
file one out of 5 million or so that want restore. I should buy a lottery 
ticket !!. 

Anway the above was just a brief description of what I did. I hope people find 
it interesting. 


Best Regards,

Peter Zutenis   
Principal Systems Programmer
Philip Morris Information Services Ltd
Moorabbin, Australia.
<Prev in Thread] Current Thread [Next in Thread>