ADSM-L

restore stg strangeness

2003-08-05 20:52:26
Subject: restore stg strangeness
From: Steve Harris <Steve_Harris AT HEALTH.QLD.GOV DOT AU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Wed, 6 Aug 2003 10:48:04 +1000
Hi All, 

I've seen some stuff I don't understand with a reclaim stg command and I'd like 
anyone who does understand to enlighten me.

TSM server is 4.2.3.3 on AIX 5.1 RML03 running in a p690 LPAR.  All disk is FC 
attached to IBM ESS arrays, disk stgpools are not mirrored but DB and log are 
using AIX mirroring. All TSM files are on filesystems most are JFS, but newer 
stgpool data is on JFS2.

We are very cautious here.  The TSM LPAR has two HBAs connected to separate san 
fabrics connected via multiple paths to two ESSes.
Despite that, yesterday all four paths to one ESS dropped out together.  
Nothing much was happening in TSM, at the time so we resynched the disks and 
continued.  However, because of unrelated issues hung over from the weekend, 
one of our disk pools was 99% full, so we decided to migrate all our data 
early.  During the migration the same disk dropped out again.  Some of the 
stgpool being migrated was on this disk and not mirrored, and  gave repeated 
error messages about errors reading the disk until it was brought back online 
to AIX, at which point the errors stopped and the migration continued, 
finishing with a FAILURE notification.

Afterward there was no data to be migrated, but the diskpool and some of its 
volumes were'nt empty.  Accordingly I ran an AUDIT VOL FIX=yes against one of 
the affected volumes.  This went OK, but on the second volume the TSM server 
died with an error attempting rollback and would not restart.

Since TSM is down we run fsck on all the affected filesystems and they are all 
clean.

So, now we're in trouble and need to restore the DB.  We do that and the redo 
of the log fails with the same error attempting the rollback.
So, now we're in DEEP trouble and just do the restore to point in time of the 
last backup.

<aside>
We run a normal sort of backup pattern.  Most data goes to diskpools overnight. 
 In the morning diskpools are backed up, then tapepools are backed up, a DB 
backup is taken, expiry is run and tapes are ejected.  In the evening the same 
diskpool/tapepool/db backup sequence is run, although it normally doesn't do 
much, and then diskpools are force migrated to tape.

Our problems happened after the morning backup cycle.
</aside>

According to the admin guide we must now run AUDIT VOL FIX=yes on all our 
diskpool volumes.  This takes 4 hours and reports huge numbers of missing files.

Next we run a reclaim stg on one diskpool and it finishes neatly, without 
mounting a tape. 
The second pool calls for a tape that was not created in the last 24 hours.  
Hmm defer that - this pool isn't important
The third and final pool is the one that was in the middle of its migrate at 
the crash.  This calls for some really strange tapes.  Eventually we produce a 
list that is 1/3 of the offsite tape pool, including some tapes that were last 
written at the end of June.

Eventually, we mark all of the volumes in the diskpools as destroyed, then 
rename the underlying files.  We add these renamed files back in to the 
diskpools as new volumes, enable sessions and we are in business again, able to 
run restore stg at our leisure the following day.

OK, so the first question is :-

When TSM couldn't read the diskpool volumes in the first place I would have 
expected it to immediately mark them as off-line and stop using them, but it 
didn't. Why? Under what circumstances do volumes go off-line?


Second question.

Restoring of diskpools seems strange.  There are two possibilities.  If a 
diskpool is "cleared" by a migration, then the data is unavailable after the DB 
restore to previous point in time, but the restore stg should only refer to 
tapes created in the most recent backup stg operation.  
On the other hand if the diskpool is not "cleared" by migration, but rather the 
data is left in place and "forgotten", then only files that are overwritten by 
new data after the DB restore point should be damaged and need restoration.  
The rest should just magically reappear when their DB references are restored. 

I can't think of any other possibilities

Sorry to have been so detailed, but I wanted you all to have the full story.  
The whole concept of having to restore data from 85 tapes after a two second 
outage is extremely worrying.  Having to get significant numbers of tapes back 
from off-site storage to do this in a hurry is even more so.

Thanks

Steve Harris
AIX and TSM Admin
Queensland Health, Brisbane Australia.



***********************************************************************************
This email, including any attachments sent with it, is confidential and for the 
sole use of the intended recipients(s).  This confidentiality is not waived or 
lost, if you receive it and you are not the intended recipient(s), or if it is 
transmitted/received in error.

Any unauthorised use, alteration, disclosure, distribution or review of this 
email is prohibited.  It may be subject to a statutory duty of confidentiality 
if it relates to health service matters.

If you are not the intended recipients(s), or if you have received this e-mail 
in error, you are asked to immediately notify the sender by telephone or by 
return e-mail.  You should also delete this e-mail message and destroy any hard 
copies produced.
***********************************************************************************

<Prev in Thread] Current Thread [Next in Thread>