Hi All,
I've seen some stuff I don't understand with a reclaim stg command and I'd like
anyone who does understand to enlighten me.
TSM server is 4.2.3.3 on AIX 5.1 RML03 running in a p690 LPAR. All disk is FC
attached to IBM ESS arrays, disk stgpools are not mirrored but DB and log are
using AIX mirroring. All TSM files are on filesystems most are JFS, but newer
stgpool data is on JFS2.
We are very cautious here. The TSM LPAR has two HBAs connected to separate san
fabrics connected via multiple paths to two ESSes.
Despite that, yesterday all four paths to one ESS dropped out together.
Nothing much was happening in TSM, at the time so we resynched the disks and
continued. However, because of unrelated issues hung over from the weekend,
one of our disk pools was 99% full, so we decided to migrate all our data
early. During the migration the same disk dropped out again. Some of the
stgpool being migrated was on this disk and not mirrored, and gave repeated
error messages about errors reading the disk until it was brought back online
to AIX, at which point the errors stopped and the migration continued,
finishing with a FAILURE notification.
Afterward there was no data to be migrated, but the diskpool and some of its
volumes were'nt empty. Accordingly I ran an AUDIT VOL FIX=yes against one of
the affected volumes. This went OK, but on the second volume the TSM server
died with an error attempting rollback and would not restart.
Since TSM is down we run fsck on all the affected filesystems and they are all
clean.
So, now we're in trouble and need to restore the DB. We do that and the redo
of the log fails with the same error attempting the rollback.
So, now we're in DEEP trouble and just do the restore to point in time of the
last backup.
<aside>
We run a normal sort of backup pattern. Most data goes to diskpools overnight.
In the morning diskpools are backed up, then tapepools are backed up, a DB
backup is taken, expiry is run and tapes are ejected. In the evening the same
diskpool/tapepool/db backup sequence is run, although it normally doesn't do
much, and then diskpools are force migrated to tape.
Our problems happened after the morning backup cycle.
</aside>
According to the admin guide we must now run AUDIT VOL FIX=yes on all our
diskpool volumes. This takes 4 hours and reports huge numbers of missing files.
Next we run a reclaim stg on one diskpool and it finishes neatly, without
mounting a tape.
The second pool calls for a tape that was not created in the last 24 hours.
Hmm defer that - this pool isn't important
The third and final pool is the one that was in the middle of its migrate at
the crash. This calls for some really strange tapes. Eventually we produce a
list that is 1/3 of the offsite tape pool, including some tapes that were last
written at the end of June.
Eventually, we mark all of the volumes in the diskpools as destroyed, then
rename the underlying files. We add these renamed files back in to the
diskpools as new volumes, enable sessions and we are in business again, able to
run restore stg at our leisure the following day.
OK, so the first question is :-
When TSM couldn't read the diskpool volumes in the first place I would have
expected it to immediately mark them as off-line and stop using them, but it
didn't. Why? Under what circumstances do volumes go off-line?
Second question.
Restoring of diskpools seems strange. There are two possibilities. If a
diskpool is "cleared" by a migration, then the data is unavailable after the DB
restore to previous point in time, but the restore stg should only refer to
tapes created in the most recent backup stg operation.
On the other hand if the diskpool is not "cleared" by migration, but rather the
data is left in place and "forgotten", then only files that are overwritten by
new data after the DB restore point should be damaged and need restoration.
The rest should just magically reappear when their DB references are restored.
I can't think of any other possibilities
Sorry to have been so detailed, but I wanted you all to have the full story.
The whole concept of having to restore data from 85 tapes after a two second
outage is extremely worrying. Having to get significant numbers of tapes back
from off-site storage to do this in a hurry is even more so.
Thanks
Steve Harris
AIX and TSM Admin
Queensland Health, Brisbane Australia.
***********************************************************************************
This email, including any attachments sent with it, is confidential and for the
sole use of the intended recipients(s). This confidentiality is not waived or
lost, if you receive it and you are not the intended recipient(s), or if it is
transmitted/received in error.
Any unauthorised use, alteration, disclosure, distribution or review of this
email is prohibited. It may be subject to a statutory duty of confidentiality
if it relates to health service matters.
If you are not the intended recipients(s), or if you have received this e-mail
in error, you are asked to immediately notify the sender by telephone or by
return e-mail. You should also delete this e-mail message and destroy any hard
copies produced.
***********************************************************************************
|