Veritas-bu

[Veritas-bu] How to prevent NBU from immediately using a medi a that failed before

2005-10-31 11:44:31
Subject: [Veritas-bu] How to prevent NBU from immediately using a medi a that failed before
From: Mark.Donaldson AT cexp DOT com (Mark.Donaldson AT cexp DOT com)
Date: Mon, 31 Oct 2005 09:44:31 -0700
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C5DE3A.12A1BECC
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

It's small so I'll jsut attach for group.  Change email & "THOLD" =
variable
at top to suit your environment.
-M



-----Original Message-----
From: Sto Rage=A9 [mailto:netbacker AT gmail DOT com]
Sent: Friday, October 28, 2005 6:47 PM
To: Mark.Donaldson AT cexp DOT com
Cc: ida3248b AT post.cybercity DOT dk; veritas-bu AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] How to prevent NBU from immediately using a
medi a that failed before


Thanks to all that replied. Looking at the issues we have been having,
I think setting
MEDIA_ERROR_THRESHOLD to 0 is the best option for us, i.e. freezing
the tape immediately.  We can then investigae the forzen tapes later
to see what indeed was the issue and unfreeze the media and reuse it
if needed. (Mark, would you mind send us the script you mentioned?)
We would like to freeze the tape the first time so that NBU doesn't
waste time using  the same tape  for the next 4 or 5 jobs in the
queue. Last time this happened, we lost lmore than 8 hours of backup
time. The fault on that tape was somewhere at the end, where it failed
to seek. So each job that failed wrote anywhere from 85GB to 100GB on
that tape before it failed (LTO-1 media).


-G

On 10/28/05, Mark.Donaldson AT cexp DOT com <Mark.Donaldson AT cexp DOT com> 
wrote:
> Frozen, though, isn't necessarily mean broken.  A media fault is =
possible
> but then there's the drive faults too, loader error, sunspots, =
plague.
>
> I've got a script that sweeps the frozen tapes, keeps a count, and
unfreezes
> them if there hasn't been enough failures.  Any tape that freezes =
over 3
> times stays frozen.  I may be a method you could adapt.
>
> -M
>
> -----Original Message-----
> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]On Behalf Of
> ida3248b AT post.cybercity DOT dk
> Sent: Friday, October 28, 2005 2:28 AM
> To: Sto Rage(c); Veritas NBU Mailing List (E-mail)
> Subject: Re: [Veritas-bu] How to prevent NBU from immediately using a
> media that failed before
>
>
> Hi G
>
> You can under INSTALLPATH/netbackup created the files
>
> MEDIA_ERROR_THRESHOLD number of allowed errors
>
> TIME_WINDOW in which number of errors occurs (number of hours)
>
> If you put 0 the first file, the tape should get frozen at the first =
error
>
> Regards
> Michael
>
> On Thu, 27 Oct 2005 11:11:11 -0700, Sto Rage(c) wrote
> > Hi,
> >   Here's my problem, a backup job writes to a media and then fails
> > with write error/position error etc. The job then gets re-queued =
and
> > runs again, then NBU uses this very same tape and writes and fails
> > again, this happens till the max retires of the job is exceeded and
> > then the job fails.
> > Why does it reuse the same tape again and again for the same
> > job/policy? Is there a counter that we can set to prevent NBU from
> > retrying a media that errors out the first time?
> > The logs below from bptm show the media ID 001956 being repeatedly =
used.
> >
> > 02:01:58.703 [5842] <2> log_media_error: successfully wrote to =
error
> > file - 10/27/05 02:01:58 001956 13 POSITION_ERROR
> > 02:29:33.454 [21029] <2> log_media_error: successfully wrote to =
error
> > file - 10/27/05 02:29:33 001956 13 POSITION_ERROR
> > 03:19:20.128 [22766] <2> log_media_error: successfully wrote to =
error
> > file - 10/27/05 03:19:20 001956 13 POSITION_ERROR
> > 04:30:34.394 [25958] <2> log_media_error: successfully wrote to =
error
> > file - 10/27/05 04:30:34 001956 13 POSITION_ERROR
> >
> >   Ironically, the 5th time it successfully wrote to this tape and
> > continued with the job.
> > We run huge NDMP jobs (average size of each is 2 TB) so when this
> > happens say 70% into a job, NBU has to start from the beginning,
> > sadly checkpoint restart is not an option for NDMP backups. Is this
> > available in 6.0?
> >
> > -G
> >
> > _______________________________________________
> > Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> > http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
>
>
> --
> Cybercity Webhosting (http://www.cybercity.dk)
>
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
>


------_=_NextPart_000_01C5DE3A.12A1BECC
Content-Type: application/octet-stream;
        name="autofroz"
Content-Disposition: attachment;
        filename="autofroz"

#!/bin/ksh

#Threshhold above which a tape remains frozen
THOLD=1

#Address for reports
MAILADDR=YOU AT YOURDOMAIN DOT COM

#Tracking file - where "second chance" tapes are tracked
TRK=/usr/openv/var/`basename $0`.trkfile

#Logfile
LOG=/usr/openv/netbackup/logs/scripts/`basename $0`.log

PATH=$PATH:/usr/openv/netbackup/bin/admincmd:/usr/openv/volmgr/bin:/usr/openv/local/bin
export PATH

[ ! -f $TRK ] && echo "#This is a tracking file for script \"$0\"." >$TRK

echo "# Script \"`basename $0`\" start: `date`" >$LOG
exec 1>>$LOG 2>&1

#For tape in list of frozen tapes
for mediasvr in `ident_media_servers`
do
  echo "# Searching media server for frozen tapes: $mediasvr" 
  for tape in `bpmedialist -mlist -l -h $mediasvr|awk '{if($15%2){print $1}}'`
  do
    tpc=`awk 'BEGIN{sum=0} {if($1=="'$tape'"){sum++}} END{print sum}' $TRK`
    if [ $tpc -ge $THOLD ]
    then
     if [ "`vmquery -w -m $tape|awk 'NR>3 && $11!="Frozen" {print $9}'`" = "-" ]
     then
       #If out of library and not already in the "Frozen" vol group
       vmchange -new_v Frozen -m $tape
       echo "Failure threshold exceeded for tape \"$tape\". Changed to 
\"Frozen\" VG."
     else
       #Log it for now but remove this later to prevent junkie report
       echo "Failure threshold exceeded for tape \"$tape\"."
     fi
    else
      echo "$tape `date '+%m/%d/%Y'`" >>$TRK
      bpmedia -unfreeze -ev $tape -h $mediasvr
      echo "Frozen tape \"$tape\" given another chance."
    fi
  done
done
echo "# Script \"`basename $0`\" finished: `date`" 
if [ `grep -cv "^ *#" $LOG` -gt 0 ]
then
  cat $LOG | mailx -s "NB Rpt: tapes managed by `basename $0`" $MAILADDR
fi
#[ -f $LOG ] && rm $LOG
exit

------_=_NextPart_000_01C5DE3A.12A1BECC--

<Prev in Thread] Current Thread [Next in Thread>