ADSM-L

Re: AIX Bare Metal Recovery - A cautionary tale

2006-09-14 10:07:41
Subject: Re: AIX Bare Metal Recovery - A cautionary tale
From: Ben Bullock <bbullock AT MICRON DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Thu, 14 Sep 2006 08:00:01 -0600
        Very helpful Steven,

        We have some AIX BMR recovery tests coming up and your insight
is helpful, especially about the Atape drivers.

Thanks,
Ben


-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Steven Harris
Sent: Wednesday, September 13, 2006 10:23 PM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: AIX Bare Metal Recovery - A cautionary tale

Hi All,

I've been working with AIX for over 10 years now, starting with AIX
4.1.2 and have always been impressed by the robustness of its mksysb
process.  However recent versions have become less robust, and as you
will see sometimes the various changes can paint you into a corner.

This is not strictly a TSM related post, but I know that there are a lot
of small AIX/TSM installations out there that might be relying on a
mksysb restore as the cornerstone of a DR procedure.

The scenario is that this client has two P620-6M2 servers.  One is the
production SAP machine, the second is the Quality assurance machine.  QA
has 2 cpus instead of 4 on prod, 1 fibre card instead of two and 4GB of
memory rather than 8GB.  Both machines back up using aix mksysb and
savevg to one half of a 3582 autoloader using LTO2 fibre connection.
Both had rootvg mirrored on on dual internal 146GB disks

The OS was AIX 5.2 at ML007, originally installed from ML002 cd media.
Microcode  was way out of date.

The test was to restore a mksysb of the prod system on the QA system
then restore the SAP data and bring up SAP.

I developed a process.  Because the tape drive was fibre attached it was
not bootable - tape was also not bootable because the boot image was too
large - a problem with AIX 5.2.  Thus I was to boot from the install
media then use that to install from the mksysb image on the Fibre
attached tape drive.

This worked well, the cd booted ok, we could see the fibre drive, the
data restored ....
But when we got to the end of the process the installation looped with
"process killed" messages.
A call was placed with IBM AIX support. It turns out that as of AIX 5.2,
when you restore using a bootable cd, the CD has to be at the same or
later level than the system you are restoring otherwise, results are
unpredictable.

At this point we had blown away the QA system but were unable to restore
prod.  We were also unable to restore QA for the same reason (yes I
tried).

A bit more investigation turned up a procdure for creating a cd image on
the current prod system that should be able to boot and allow the
restore to proceed.  As there was no burner on these machines I created
the image on prod, copied it to a windows box and burnt it there.

The QA machine booted from the resultant CD, but after boot, the only
device that it could see was the CD drive.  The tape drive was simply
not visible.  Another call to AIX Support.  It seems that the Atape
driver that is required to use the LTO2 tape drive is developed by the
storage people and not by the AIX people. It is non-standard and when
installed changes the AIX ODM in a way that corrupts the recovery CD
that is generated so that the tape drives are not visible, giving the
symptoms that I saw.

At this point the client was getting anxious to have their machine back
so, using the original 52-002 install cds I did a new OS install to one
of the internal disk drives, including the alt_disk_install package,
then used alt_disk_install to restore the old os to the other internal
disk.  It would not boot from the new image.

This was now late Friday afternoon, and two full days had been expended
in the process. Over the weekend the client had an attempt to restore
using my original plan but got the same result as I did.

Monday morning I was back on site with a colleague who is also well
versed in AIX.  We decided to apply a microcode update to the machine
that would allow it to boot from the larger boot image direct from tape.
We brought with us another, scsi attached LTO2 drive that we connected
to the built in SCSI bus on the machine.  The microcode update was very
slow, and not helped by the fact that we needed to use floppies, and
no-one uses these any  more so it took some time to find them.
Eventually the microcode was updated, but the machine would not boot
from the tape, giving an error code that indicated IO errors on the tape
drive.

After a couple of attempts we reverted to re-installing the operating
system and performing an alt_disk_install.  We had upgrade cds to 52-008
on hand so we upgraded to that level before doing the alt_disk_install
process.  We could not read the tape on the SCSI attached drive so
eventually reverted to the Fibre attached drive and the data restored.
This time the OS booted, but, we could not connect to it.

The machine has a video card and uses a standard LCD screen and keyboard
to provide a console.  nothing was appearing on it so we used a serial
cable to connect to the serial port using hyperterm from the laptop.  We
could get a login prompt, but could not login.

This was how we ended day three.

Day four, we attended with a second SCSI attached tape drive, and a 3151
terminal.  Once the terminal was plugged in we could log on, then
activate the lft console and before long everything was back up and
running.


Lessons:

1. Always keep a green screen terminal on hand - actually a null modem
cable into hyperterm would have done the trick.

2. AIX 5.2 and above require you to generate a new boot CD after every
maintenance update.

3. If you use the Atape driver, the generated boot cd is not usable.  I
would suggest that Atape users should order a refresh of their install
CDs with every maintenance update they do.

4. test your recovery procedures before you need to.  At least in this
case production was not affected.



I hope all this helps someone.

Regards

Steve

Steven Harris
AIX and TSM Admin
Brisbane Australia

<Prev in Thread] Current Thread [Next in Thread>