ADSM-L

ANR9999D with MAGIC (was: ANR9999D Error message: Do you how to figure out this message ?)

2002-06-06 12:11:31
Subject: ANR9999D with MAGIC (was: ANR9999D Error message: Do you how to figure out this message ?)
From: "Prather, Wanda" <Wanda.Prather AT JHUAPL DOT EDU>
Date: Thu, 6 Jun 2002 12:09:42 -0400
(Sorry for the length of this, but this is complicated.  Don't read it
unless you are getting ANR9999D on reclaims):

ANR9999 is a generic bucket message for anything that doesn't have its own
error message number. so you can get ANR9999D for lots of reasons.

The original post had this information:

  ANR9999D ssrecons.c(2405): ThreadId<50> Expected:
                          Magic=53454652, SrvId=0, SegGroupId=5855761.

Whenever I have gotten ANR9999D on a RECLAIM with BOTH keywords: ssrecons.s
AND Magic=, I have found there really are a few backup files that are
permanently damaged.

When I get this error, if I run an AUDIT against the tape, it finds the
damaged files. (You have to be at 4.1.3 or above for the AUDIT to find it,
and if the error is from an offsite reclamation, you should run the AUDIT
against the primary pool tape, not the tape that was being reclaimed.)

If the files are damaged, I can run a RESTORE VOLUME to recreate the primary
files from the offsite pool.  The RESTORE VOLUME will complete OK, but the
files are still damaged, and the resulting new tape can't be reclaimed,
either.   I end up deleting the remaining files on the tape to clear the
problem.  (Clearly there is a bug in RESTORE VOLUME.)

If you try the RESTORE VOLUME, check the results by RUNNING AN AUDIT ON THE
RESULTING TAPE.  If it still says the tapes are damaged, those files are
toast.

I have also found that MOVE DATA (at least without RECONSTRUCT) will move
the files and complete with no complaints, but the files will still be bad,
and the resulting tapes will still not reclaim completely.  So I assume the
problem is only detected when TSM is reconstructing an aggregate.

If this error occurs on reclaim of a primary volume, you will find the
volume reclaims down to just a few damaged files.  You can Q CONTENT for the
volume to see what is damaged, then DELETE with DISCARDDATA.  (As ALWAYS -
NEVER RUN DELETE with DISCARDDATA until you are DARN SURE YOU KNOW WHAT YOU
ARE DOING.)

If the error occurs on reclaim of a copy pool volume, it is harder to decide
what to do.  The copy pool volume will also reclaim down to just a couple of
files, the AUDIT will show damaged data on the primary tape, but the primary
tape may still have a zillion good files on it.  (DON"T PANIC and think that
whole volume is bad, it probably isn't.)  I usually get the file names from
the AUDIT output, and DELETE the copy pool tape, knowing that the problem
may surface again when the primary tape reclaims.

In our case, every time I had this error occur, I have been able track it
back (I think) to a DB restore we did 2 years ago. I also saw a couple
errors like this when we had some intermittent hard disk errors caused some
damage in my disk pool.  DB restores and hardware errors are both legitimate
reasons for data damage.  I have NO REASON to believe there is anything out
there CREATING these errors for NEW BACKUPS, I don't think there is any
integrity problem, except the bug in RESTORE VOLUME.

In our case, when we did the DB restore 2 years ago we lost about 24 hours
of backup data.  But also reclaim had been running, and since my primary
pool is collocated, a lot of tapes had been touched.  We did an AUDIT on
EVERY tape that was touched, but that was TSM 3.7 and AUDIT at TSM 3.7
didn't catch the problem (there were apparently significant AUDIT
improvements at 4.1).  The ANR9999D errors started surfacing months later
when the tapes went through reclaim.  We ran RESTORE VOLUME on them to
rebuild the data from the offsite pool.  RESTORE VOLUME completed OK, and I
assumed the problem was fixed.  ONLY LATER (like a year later) when THOSE
tapes started reclaiming did the ANR9999D errors surface again, and I
realized that RESTORE VOLUME was faulty.  But every time one of these errors
surfaces (I have had 5 in the last 6 months) I check the files that won't
reclaim against the BACKUPS table, and find that they are small numbers of
old backups that were probably caught in our DB debacle.  SO again, this is
a case of the reclaim catching OLD problems.  I have NO REASON to believe
there is anything out there CREATING these errors for new files.

I have found NO DOCUMENTATION ANYWHERE that explains this problem.  All this
is stuff I've done on my own, and from which I have drawn my own
conclusions.   Use the information if it helps you, but your situation may
be different.

Whew.  Time for lunch.
************************************************************************
Wanda Prather
The Johns Hopkins Applied Physics Lab
443-778-8769
wanda_prather AT jhuapl DOT edu

"Intelligence has much less practical application than you'd think" -
Scott Adams/Dilbert
************************************************************************







<Prev in Thread] Current Thread [Next in Thread>
  • ANR9999D with MAGIC (was: ANR9999D Error message: Do you how to figure out this message ?), Prather, Wanda <=