Veritas-bu

[Veritas-bu] Update and more information: FROZEN media problems in available_media

2002-06-04 18:24:58
Subject: [Veritas-bu] Update and more information: FROZEN media problems in available_media
From: larry.kingery AT veritas DOT com (Larry Kingery)
Date: Tue, 4 Jun 2002 18:24:58 -0400 (EDT)
Christopher Jay Manders writes:
> UPDATE:
> 
> The problem is getting alot worse. We had alot of 96 errors last night. The
> available_media seems rather cluttered, too. More on that below...
> 
> So, I found a doc by Sun Prof Support that indicates that the image database
> can get out of sync with the media manager database somehow.
> 
> It says that if you can do a vmquery -m mediaid but not bpmedia -unfreeze -ev
> mediaid then this is likely the case.

Nope.  This is completely normal behavior.  Media Manager
(e.g. vmquery) talks to the volume database (volDB) which keeps track
of things like where tapes are, when they were last used, etc.
NetBackup (e.g. bp*) tracks tapes (in the mediaDB's) that have
unexpired backup data on them.  So, if a tape is blank or expired,
it'll show up in vmquery (note the time assigned field should be
empty), but not in the bp* commands.

The other important item is that when using many of the bp commands,
you need to specify which media server (using bpimagelist -summary is
one way to get a list of all assigned tapes and which server owns them
at the moment (a NetBackup tape will be used by only one media server
until it expires, then it can be used by another - but never more than
one at one time).

> 
> How do I fix this? I have an L180 and an L3500, each on a separate media host.
> Each has about 100 tapes in the FROZEN state.

Okay, first off you need to figure out WHY they're frozen.  They are
probably frozen for a darn good reason (to protect them), so if you
unfreeze them without fixing the "real problem" they'll just get
frozen again.

> 
> There are no hardware issues that I can find. We have scripts that report
> offline and down drives, and monitor /var/adm/messages with swatch looking for
> h/w errors and stuff.
> 
> I'll just list all the quirks here to see if a bigger pattern than I can see
> is developing...
> 
> Another caveat that is interesting is that we had alot of DBBACKUP tapes in
> the available_media output until I put a 'sleep 5' in front of the main
> bpimagelist command being run in there. Now we only get a couple of DBBACKUP
> tapes. This DBBACKUP tape  thing happened shortly after adding another media
> host to our NetBackup server cluster.
> 
> Another 'symptom' is that we have alot of AVAILABLE tapes in the
> available_media output that have a robotic type of NONE and no robnum or
> robslotnum, but have a media type (DLT) and the barcode/media ID are listed.
> Why are these in here. It seems to be cluttering things up, and I wonder if
> there is a problem with

These are simply tapes which MM/NBU know about, but aren't in a robot
at the moment.

> 
> We do also get a number of tapes that no matter how many times you inventory
> the robot and then in the software (or via vmupdate) the slots appear skewed.
> By that I mean, available_media shows a slot of 25 for a mediaid that is not
> really even still in the robot???? Again, we have updated the robot in the
> inventory.

Are you checking the vmupdate exit status?  My first thought is that
the update is failing.  One possible reason is that if you take a tape
out of one robot and put it in another and try to inventory the second
robot before the first it will see this as a mismatch.

> 
> We were operating fine for a very long while (7 months, at least) doing
> exactly what we have been doing, without variance, and then suddenly alot of
> these 96 errors start showing up along with DBBACKUP and FROZEN tapes. Nothing
> appears to be able to get the FROZEN tapes to unfreeze, either.
> 
> The FROZEN tapes are ALL fresh, new, tapes. But, just so you know, we have
> tried OLD Legato tapes and OLD Veritas tapes with the same effect. ALL freeze
> up after a single try in a drive.

bperror -U -hoursago XXX 

should say why they got frozen.  You'll also probably want to look at
/usr/openv/netbackup/db/media/errors on the media servers and look for
patterns.  One of those columns is drive index, BTW.

> 
> Something else that is weird is that we had a situation where a restore was
> calling for a tape, but the barcode label on the tape did not match at all the
> contents. We had assumed this was what patch 110539 fixed...as we also have 3
> ether drops to each box (each on a separate subnet, but round-robin DNS to the
> same hostname) and that was mentioned as part of the fix for that patch.
> 
> I trace the problem from either near when we switched the contects of one
> robot (L1800) with another (L3500). that is when the DBBACKUP tapes started to
> show up.
> 
> It was shortly thereafter that FROZEN media started, I think.

Yes, that makes a lot of sense.  When you add a drive, you need to
tell MM which drive it is (in the robot) and which device file.  If
you get this configuration wrong, say by swapping the device files for
two drives, you can run into a situation where the internal tape label
(RVSN) doesn't match the barcode (EVSN).  Now, as soon as you move
stuff around you can have an issue (how do you know which tape is
which?).

Easy way to test this is to take the tape in question and put it in a
drive.  Use the vmoprcmd command to see what the internal label is -
does it match the barcode?

> 
> So, we have 3 media hosts, one of which is the master. Servback, getback and
> flashback. Each has 3 network interfaces and at least 8 Diff scsi channels. We
> use only a few of the scsi channels, so I have a bunch extra.
> 
> Here is an example of the discrepency. Note that vmquery shows the mediaid,
> but nothing in the bp* commands sees the media:
> # vmquery -m F00132
> ================================================================================
> 
> media ID:              F00132
> media type:            DLT cartridge tape (11)
> barcode:               F00132
> description:           Fulls
> volume pool:           Fulls (2)
> robot type:            TLD - Tape Library DLT (8)
> robot number:          2
> robot slot:            100
> robot host:            getback
> volume group:          00_002_TLD
> created:               Mon Jun 03 14:25:40 2002
> assigned:              ---
> last mounted:          ---
> first mount:           ---
> expiration date:       ---
> number of mounts:      0
> max mounts allowed:    ---
> ================================================================================
> 
> So, it is in the image database.

It's in the volume database.  The image database is something
completely separate (which we won't address here at all).

> 
> But, not the NB media database:

Correct.  Notice above that the assigned time is blank.  That means
the tape is effectively blank, and the bp commands won't know anything
about it.  If there was a time assigned, I'd suggest using the -host
option on the bp commands.

> 
> # bpexpdate -ev F00132 -d 0
> Are you SURE you want to delete F00132 y/n (n)? y
> requested media id was not found in NB media database and/or MM volume
> database
> 
> OR:
> 
> # bpmedia -ev F00133 -unfreeze
> requested media id was not found in NB media database and/or MM volume
> database
> 
> So, I note that vmquery -pn Fulls (for example) does show all the media, but
> this is not carried into the NB database, so the media ids have no STATUS
> line.
> 
> 
> Any thoughts or pointers would be excellent.  I amn stumped. We have had no
> big issues until this...

If you have any media servers sharing robots, or potentially tapes,
you should probably check the volume database host configuration
(tpconfig -l -d on each media server) and make sure they're using a
common volume database.  If they're not, don't just try to change it,
you'll want to get some help straightening that out.

> 
> 
> Thanks!
> 
> Chris
> 

-- 
Larry Kingery 
            Enter any 11-digit prime number to continue...