[Veritas-bu] 9840's are downed due bad media

On Fri, Oct 22, 2004 at 02:50:56PM +0200 or thereabouts, Rolf C wrote:
> We have an STK Library connected with ACSLS and 3 STK 9840 tape drives.
> Running NBU 4.5MP6 on Windows 2000 and Windows 2003
>
> We have the problem that Veritas is putting a 9840 down with a bad media
> error (on many different tapes) The tapedrive gets unavailable for the other
> media servers aswel. We only can get the tapedrive up again after restarting
> the NBU services on all media servers.
>
> I thought that bad media gets frozen, but not that the tapedrives are
> downed. This is a big issue this weekend because 2 bad tapes can fail all
> our backups this weekend.
>
> Anyone an idea how to configure Veritas so it will not down the drive when a
> bad media error appears. (it are really bad media and not bad device errors)
>
> Thanks

Hello Rolf-

Yes you can probably correct the behavior that you describe.
FYI, here's some background on this:

NetBackup tries to be smart about properly reacting to
a bad drive vs bad media, but it's rules for doing so are
relatively simple so confusion is possible.

Each time an I/O error occurs on a read/write/position, bptm 
logs this fact in a file (/usr/openv/netbackup/db/media/errors).
Items logged per entry are time of error, media id, drive index, 
and type of error.

Sample entries in this file are:
07/21/03 04:15:17 ABC123 4 WRITE_ERROR
07/26/03 12:37:47 ABC456 4 READ_ERROR

Also, each time an entry is made, past entries are scanned to 
determine if the same media id or drive has had the same type 
of error in the past "X" hours. The default value for "X" is 12 hours.

When performing the history search for the TIME_WINDOW entries, bptm 
keeps track of past errors that match the same media id, or the 
same drive, or both. The purpose of this is to attempt to determine 
the cause of error. For example, if the same media id gets write errors 
on more than 1 drive, it is assumed that the media is bad and NetBackup 
freezes the media. If different media id's get the same error on the 
same drive, it is assumed the drive is bad and it goes to a "DOWN" state.  

If all that is found is past errors on the same drive with the same 
media id, then it is "guessed" on the side of the media being bad and 
it gets frozen. (this is what I suspect is your situation)

NOTE: this action obviously doesn't happen on the FIRST I/O error that
is encountered for a given media or drive.  By default, it takes 3
I/O errors within the past X hours before these actions occur.

Your saving grace here is the fact that this behavior is configurable.
On each media server, it is possible to tune the relevant variables
by creating specially named files that contain the desired value.

The files are:

/usr/openv/netbackup/TIME_WINDOW   (text file with a number that is hours)
/usr/openv/netbackup/MEDIA_ERROR_THRESHOLD (text file with a number)
/usr/openv/netbackup/DRIVE_ERROR_THRESHOLD (text file with a number)

So for example, if you want NetBackup to FREEZE a tape if it gets 2 I/O
errors within 24 hours, AND you want it to DOWN a drive if it gets 5 I/O
errors within 24 hours, you would do the following on the media server:

echo 24 > /usr/openv/netbackup/TIME_WINDOW
echo 1 > /usr/openv/netbackup/MEDIA_ERROR_THRESHOLD
echo 4 > /usr/openv/netbackup/DRIVE_ERROR_THRESHOLD

HTH
rob