[Veritas-bu] STK Experts, need help

Based on the sense key, asc, acsq reported by the check condition
from the drive, it suggests one of the following:

1. The drive was accessed by another initiator.
   Check the media header to see if it is still valid.
   If not, its been overwritten and this is the likely cause.
   Running 3.4.3 you can do bpmedialist -mheader -ev <MEDIAID>
2. The media is bad physically. Try to append to it as a test case.
   If you can, using the same drive.
2. There was a glitch in the data path from the media server to the
   tape drive.
3. The tape drive is bad.

If the drive is working fine now, the media is valid and the media can be
appeneded
to I would suggest it was 2. Hopefully it is a fabric architecture not loop.

As for cleaning, remove cleaning tapes from Netbackup's volume database
and turn on cleaning in ACSLS. It works.

Regards
Peter Marelas 

-----Original Message-----
From: David A. Chapa [mailto:david AT datastaff DOT com] 
Sent: Tuesday, 2 July 2002 1:54 AM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: [Veritas-bu] STK Experts, need help



Attached is an excerpt from a bptm log for a particular bptm process that
was 
running over the weekend.  What eventually happened is one of the drives was

down'd by NBU during the duplication process.  I was wondering if it was a 
media problem, but then I noticed that the media that was in use was used by

another duplication stream, then another, etc.  It just finished writing a
few 
minutes ago, successfully I might add.  Now I'm wondering if it is the
physical 
drive.

The entries in particular that I'm interested in hashing out are the 
tape_error_rec: entries in the bptm log.

What does that mean?

What's with the delay 3 minutes before next attempt, tries left = 5

[that means a total of 18 minutes of possible delays, what do the delays 
constitute?]

And then the entries at 
03:57:16 <2> tape_error_rec: attempting error recovery, delay 3 minutes
before 
next attempt, tries left = 3
04:00:16 <2> tape_error_rec: absolute block position after error is 280103
04:00:16 <2> tape_error_rec: locating to absolute block number 280103 for
error 
recovery
^^^^^^^
What kind of recovery is it attempting???

04:01:02 <2> tape_error_rec: locate failed in error recovery, locate scsi 
command failed, key = 0x4, asc = 0x44, ascq = 0xb6 ^^^^^^^^^^^^^^ Failed the
recovery with a SCSI command failure?  Does this point to the 
DRIVE?  ACS/LS (incidentally ACS/LS has been installed and working great
with 
not problems for quite some time)

04:01:02 <2> tape_error_rec: attempting error recovery, delay 3 minutes
before 
next attempt, tries left = 2
04:04:02 <2> tape_error_rec: absolute block position after error is 280035
04:04:02 <16> write_data: cannot write image to media id ZA0962, drive index

104, I/O error
04:04:02 <2> log_media_error: successfully wrote to error file - 07/01/02 
04:04:02 ZA0962 104 WRITE_ERROR
04:04:02 <2> wait_for_sigcld: waiting for child to exit, timeout is 300
04:04:02 <2> check_error_history: called from bptm line 12312, EXIT_Status =
84 04:04:03 <2> check_error_history: drive index = 104, media id = ZA0962,
time = 
07/01/02 04:04:02, both_match = 0, media_match = 0, drive_match = 2 04:04:03
<2> tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/ZA0962
04:04:03 <8> check_error_history: DOWN'ing drive index 104, it has had at
least 
3 errors in last 12 hour(s)

I've also attached a small excerpt from the messages file as well.

Any ideas would be greatly appreciated.

TIA
David