TSM DB Corruption and DB Recovery - what happened ? (a saga,

Our TSM DB was corrupted last week.  Worse yet, it appears* that an
earlier DB Backup operation backed up a corrupted DB, but reported
"completed successfully".  Efforts to restore from that DB Backup failed
twice, in the end costing us 24 hours restore time.  We ended up having to
restore (rollback to) the prior-day DB Backup.  Between the 1-Day rollback
and 24 hours lost time performing 2 failed DB Restores, we lost 2 night's
backups for 83 Unix Servers, including 1 night of otherwise-successful
backup processing.

(please review the "Summary of Events" appended to this email before
reading on)

We run TSM Server 4.1.2.0 on Solaris 2.6.  Library is IBM 3494.  DB >50GB
& under 90% Util.  TSM DB & Log volumes are TSM-mirrored.  TSM Log =4GB &
rarely over 10% Util.  TSM LogMode=Normal.  MirrorWrite-Log=Parallel.
MirrorWrite-DB=sequential.  We do not perform automatic expiration or tape
reclamation, and no tape reclamation was performed on this or the prior
day.

We normally don't run client backups during our TSM DB Backups - but
that's just local practice & is a consequence of the sequence of our Daily
Task processing, not a policy.  I know (and Tivoli Support confirmed) that
TSM is designed to support concurrent client backup session and DB Backup
processes (how else could you run 24x7 backups?), and previously we have
allowed certain long-running backups to run concurrently with DB Backups
before, without consequence.  Still, for us, a concurrently-running client 
backup and DB Backup is an
atypical event.

I would like experienced feedback on where things went awry in the
following series of events (appended below), opinions as to what step may
have caused the DB corruption and subsequent apparently-corrupted DB
Backup (despite reporting successful completion).

* Tivoli Support asserted that:
 - in TSM 4.1 the DB Backup operation does not perform robust consistency
checking, so could backup a corrupt DB and still report success -
apparently an APAR exists on this.  Can anyone confirm this?
 - consistency checking Tivoli has been improved & is more robust in TSM
4.1 and/or 5.1.  Can anyone confirm this?
 - it is impossible to incorporate thorough consistency-checking, as is
performed by Audit DB, in presumed-daily DB Backups because of the elapsed
time it requires.  On a >50GB DB such as ours, Tivoli asserted that Audit
DB would take over 50 hours (obviously not possible twice daily).   Can
anyone with similar configuration (>50GB TSM DB on Solaris) confirm this
time estimate for Audit DB?

I would also welcome any opinions/ideas how to recover that
apparently-corrupted DB Backup (I'm still not 100% convinced it is).  We
want to do this on a test server to salvage the prior nights backup data
and also salvage the Activity Log for further review, if possible.  All
copypool and DB Backup tapes from that day were pulled/preserved.  I don't
think there's a way for us to re-incorporate the lost data back into our
production TSM backups easily - but brilliant ideas are welcome.  Even so,
we would at least regain the ability to access/restore the prior night's
backup data from those preserved copypool tapes, which would mitigate
potential service impact of this incident by 50% (just 1 day lost, not 2).

Are there any methods for restoring only selected parts of a TSM DB ?

rsvp, thanks (experienced respondents only, please)

Kent Monthei
GlaxoSmithKline
_________________________________________________________________

Summary of Events:

1)  A rogue client backup started just before our morning DB Backup (1st
of 2 scheduled daily full DB Backups).  This was after a scripted check to
ensure that no client backups were running (none were) and just prior to
start of the DB Backup (we think).  The check for client sessions is not so 
much to ensure nothing runs during DB Backup - it's to ensure that all
client data reaches the diskpool and then gets migrated or copied to tape
pools before the DB Backup starts.  Still, for us, the rogue client backup
was an atypical event.

2)  The DB Backup stalled immediately, sitting at 0 pages backed up for
over an hour, but TSM Services were not hung & did not fail.  The client
backup was progressing fine & pushing a lot of data.  To resolve the
stalled DB Backup, we cancelled the client backup session (no effect). We
then cancelled the DB Backup process (no effect - it entered & sat for an
hour in 'Cancel Pending' state).

3)  At that point, we decided to halt/restart the TSM Server process.
'dsmserv' came back up normally.

4)  We then repeated the 1st DB Backup, which then progressed normally in
the usual time and reported successful completion.  We continued with Daily 
Task processing, which went smoothly up to the
2nd DB Backup.

5)  Almost immediately after startup (0 DB pages backed up), the 2nd DB
Backup process failed with a 'dballoc.c / SMP page mismatch /
initialization of DB page allocator failed' error (something to that
effect).

6)  We decided to halt/restart services a 2nd time.  This time, services
wouldn't restart.  There were no errors in dsmserv.err, no OS/hardware
errors in /var/adm/messages and no core file.  After working with Tivoli
Support, we started dsmserv in the foreground and saw that it was now
reporting the same 'dballoc.c' error as the attempted/failed DB Backup
earlier in the day.

7)  We elected to perform a TSM Restore DB from the 1st DB Backup that day
(the repeat attempt that reported successful completion).
The Restore DB successfully reformatted the TSM Log and successfully
restored 100% of DB Pages in about 3 hours, but then failed during DB
Initialization with the same 'dballoc.c' error.

8)  On the outside chance there was a log-pinned/Log-full condition, we
preformed a TSM Extend Log, which ran to completion & added 400MB, then
attempted to restart TSM, but 'dsmserv' failed with same 'dballoc.c'
error.

9)  With the extended log now in place, we repeated step 7.  The Restore
DB performed identically, failing after 3 hours.

10)  We rolled back to the 2nd DB Backup from the prior day and performed
another Restore DB, which succeeded after 3 hours.  We immediately
disabled client sessions and then performed Audit Volume on all diskpool
volumes.  Tape reclamation had not been performed the prior day.  We
pulled/preserved all copypool tapes created the day the problem occurred
and also pulled/preserved the apparently-corrupt DB Backup tape.
TSM DB Corruption and DB Recovery - what happened ? (a saga, not a short story)