ADSM-L

Re: urgent! server is down!

2003-03-12 08:41:59
Subject: Re: urgent! server is down!
From: Dan Foster <dsf AT GBLX DOT NET>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Wed, 12 Mar 2003 13:41:07 +0000
Hot Diggety! Richard Sims was rumored to have written:
> >After a reboot yesterday tsm doesnt start. ...
> ...
> >ANR0900I Processing options file dsmserv.opt.
> >ANR000W Unable to open default locale message catalog, /usr/lib/nls/msg/C/.
> >ANR0990I Server restart-recovery in progress.
> >ANR9999D lvminit.c(1872): The capacity of disk '/dev/rtsmvglv11' has
> >changed; old capacity 983040 - new capacity 999424.
> ...
> I would take a deep breath and stand back and think about that situation
> first...  There's no good reason for a server to be running fine and
[...]

I agree with what Richard had to say. Taking a deep breath is always step
#1 for handling a crisis without making it worse.

999424 - 983040 = 16384, which is exactly 16 MB and sounds suspiciously
like the PP size. 'rtsm...' sounds like a raw LV rather than a filesystem.

Perhaps someone with root access had done this at some point earlier:

# extendlv tsmvglv11 1

[or had done the equivalent in SMIT.]

(DO NOT EXECUTE THE ABOVE COMMAND! I am only theorizing what may have
happened)

As for eng_US vs C, do this:

# grep LANG /etc/environment

If it says LANG=C then try:

1. Changing it to LANG=en_US in /etc/environment
2. At the root prompt: # export LANG=en_US
3. Try starting up TSM now

And you will probably want to ask your operations staff if anyone had
increased the LV's allocation, perhaps by one physical partition with
extendlv or similar. If someone had done it, I'd have made them put Humpty
Dumpty back together as a great learning experience ;) Tell people to *NOT*
mess around with the TSM server if they do not know what they're doing.

I did a quick test with TSM 5.1 by creating a small 16 MB DB logical
volume (1 PP), started up server OK. Then I did 'extendlv tsmdblv 1',
halted server, and tried to start it up again. I got the exact same
errors you got.

I suspect you may have to remove that LV, recreate it with the expected
size that TSM wants, then do a DB restore from your most recent full db
backup tape.

But before you do that, you'll want to save a copy of your current device
config and volume history file if you have these, as well as your
dsmserv.opt file. Then look in the TSM 5.1 Server for AIX Administrator's
guide at:

http://publibfp.boulder.ibm.com/epubs/pdf/c3207680.pdf

(This is assuming you use TSM 5.1 for AIX; if you use another version,
you'll want to consult that guide instead, but the steps will probably
be similar or still exactly the same.)

DB restore is covered in Chapter 22. 'Restoring a Database to its Most
Current State' at bottom of page 524 is probably your easiest option since
it sounds like you have everything else intact -- volume history info,
logvols, stgpool vols, etc.

Then you'll have to delete (with 'rmlv -y tsmvglv11') the offending LV,
and recreate it (with 'mklv -y <vg> tsmvglv11 <number of PPs>'). Then...

Find out which tape has the most recent full DB backup, then do:

# cd /usr/tivoli/tsm/server/bin
# ./dsmserv restore db devclass=<whatever> vol=<tape volser>

If that command worked (it's a preview, basically), then do:

# ./dsmserv restore db devclass=<whatever> vol=<tape volser> commit=yes

...which will make the restore actually happen, for real.

The actual restore operation is no big deal if you have a good and recent
db backup tape, and know which tape it is. I did this as part of testing
recently, and it worked right off the bat with no problems at all.

If you don't know which tape volser has the latest full db backup, then
you could look into your volume history file. For example, with my setup:

backup1:/usr/tivoli/tsm/server/bin# grep BACKUPFULL volhist.cfg
 2003/02/21 14:41:24  BACKUPFULL          5      0      1
3584_DEVCLASS1                 ROC010
 2003/03/01 21:06:51  BACKUPFULL          6      0      1
3584_DEVCLASS1                 ROC012

(The file might be called 'volhistory.cfg'; I had explicitly defined mine
to be 'volhist.cfg' at server installation time.)

I have two full db backup tapes... one was done on 2/21, is version 5.
The more recent was was done on 3/1, version 6. So I'd restore ROC012,
for example.

I should warn you that any backups done after the date/time of the most
recent DB backup will effectively be lost, so be sure you really do want to
restore the DB before committing to it. Doesn't sound like you have too
much of a choice in this particular case.

Note for the discerning ADSM-L reader: my production server has daily db 
backups! The above was from a test box.

-Dan

<Prev in Thread] Current Thread [Next in Thread>