[Veritas-bu] NetBackup Reporting

This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C60002.6B7D1C36
Content-Type: text/plain;
        charset="iso-8859-1"

Of course, what'cha gonna do when a single stream of a multi-stream backup
fails continually while the other succeed?  It has to be sorted down to that
level.  (Like my D: drive on one problem NT box that can't stay connected
long enough to back up).

(Anybody else have the "Cops" theme in their head now?)

I've attached my script that checks for repeated image-level failures.  It
has two parameters on the top that contol the alarm point for unix & NT
(different because our NT level jobs are less important to me, sorry Bill.)

Resolution is in days.  

  #Search Depth for backups
  DAYSBACK=2
  DAYSBACKNT=3

  #Mail alias for report
  [email protected]

Change the ADDR variable above, too.

It works pretty well - it uses the failures & successes of the past attempts
to populate what is searched for.  It doesn't care how it was backed up, ie:
incremental, full, etc, but just that it was attempted.  If I have a full
backup fail on a weekend, I'm not too concerned if an incremental was
covering for me during that time - I still have a restorable backup.  At
this time, it only checks for Standard & NT jobs - it ignores other types
(Oracle, etc).

The script is also not smart enough to know if a backup of a "missed"
filesystem is currently active at the time the data is gathered.  The the
status of a running job is in flux, so to speak.  It only reports on
completed jobs.  Status code 0 & 1 jobs are considered succesful.

The DAYSBACK variable can't be any less than your longest regularly
scheduled server idle time or you'll get false alarms. It's easier to give
an example.  

If you do a weeknight incremental (M-F) and a weekend full (Sat) and nothing
on Sunday normally, AND if your DAYSBACK is set to 1, then you'll get alarms
on Sunday for a missed backup since no backup was recorded for that previous
24 hours.  Since nothing was scheduled, it's OK that it was missed but the
script isn't smart enough to know that on its own.  The only interval that
makes sense in this case is a 2-day value.  I use a "frequency + 24 hours"
algorithm for my shop - I do daily backups with no skipped days so 1-day
would be probably too much notification but 2-days is a pretty good balance
point, IMO.

Anyway, for what it is, it's a pretty decent CYA for us - it may be for you,
too.

Script attached.

-M

-----Original Message-----
From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]On Behalf Of Piszcz, 
Justin
Sent: Tuesday, December 13, 2005 7:31 AM
To: Paul Keating; veritas-bu AT mailman.eng.auburn DOT edu
Subject: RE: [Veritas-bu] NetBackup Reporting


What I am saying is I only want to see actual failures, the two/three
scripts I use show error 41 being a failure for instance, even though it is
also marked as being backed up successfully.
A failure [usually] means that a job failed > 12 times; however, there have
been cases where there is a failure with no retries.
 
 
 



From: Paul Keating [mailto:pkeating AT bank-banque-canada DOT ca] 
Sent: Tuesday, December 13, 2005 8:37 AM
To: Piszcz, Justin; veritas-bu AT mailman.eng.auburn DOT edu
Subject: RE: [Veritas-bu] NetBackup Reporting
 
So what you're saying is that you occasionally need to re-run a job 12+
times to get a backup without a status 41 or 13?
 
Basically you're looking for a script that tells you which machines in
active policies have not had a successfull backup in the last x hours?
 
Paul
-----Original Message-----
From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Piszcz, 
Justin
Sent: December 13, 2005 7:23 AM
To: veritas-bu AT mailman.eng.auburn DOT edu
Subject: [Veritas-bu] NetBackup Reporting
Does anyone have a script that will ONLY report clients that REALLY failed?
I have retries set to 12, even though they never usually go past this number
(maybe one or two retries on a few clients) - I have two/three scripts that
show them as failures, while in reality they are error(41s) or file read
failed(13) which retry and run successfully.
Does anyone have a full-proof reporting script?
 
Justin.


------_=_NextPart_000_01C60002.6B7D1C36
Content-Type: application/octet-stream;
        name="endangered_check"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
        filename="endangered_check"

#!/usr/bin/ksh=0A=
=0A=
#=0A=
# This utilty searches the bpdbjobs output for filesets, parsing=0A=
# down to return code, client, policy, & fileset.  It then sends=0A=
# a warning in the event a fileset fails more than $DAYSBACK number=0A=
# of days in a row.=0A=
#=0A=
# Mark Donaldson - CExp - Nov 14, 2005=0A=
#=0A=
=0A=
#Search Depth for backups=0A=
DAYSBACK=3D2=0A=
DAYSBACKNT=3D3=0A=
=0A=
#Mail alias for report=0A=
[email protected]=0A=
=0A=
#Path to bpdbjobs & seconds_since_epoch=0A=
PATH=3D$PATH:/usr/openv/netbackup/bin/admincmd:/usr/openv/local/bin:/usr=
/local/bin=0A=
=0A=
TMP1=3D/tmp/`basename $0`.$$.1.tmp=0A=
TMP2=3D/tmp/`basename $0`.$$.2.tmp=0A=
LOG=3D/usr/openv/netbackup/logs/scripts/`basename $0`.log=0A=
=0A=
trap "[ -f $TMP1 ] && rm -f $TMP1 =0A=
      [ -f $TMP2 ] && rm -f $TMP2 =0A=
      exit " 0 1 2 3 4 5 6 7 8 10 11 12 13 14 15 =0A=
      =0A=
exec >$TMP2 2>&1=0A=
=0A=
now=3D`seconds_since_epoch`=0A=
secs=3D`expr $DAYSBACK \* 86400`=0A=
secsNT=3D`expr $DAYSBACKNT \* 86400`=0A=
past=3D`expr $now - $secs`=0A=
pastNT=3D`expr $now - $secsNT`=0A=
=0A=
bpdbjobs -all_columns | sed 's/connecting.*$//'| \=0A=
   gawk -F',' '{if($2=3D=3D0 && $3=3D=3D3 && $15!=3D"" && \=0A=
               (( $22=3D=3D0 && $11>=3D'$past' ) || ( $22=3D=3D13 && =
$11>=3D'$pastNT'))){ =0A=
               count=3D0;string=3D$33;numfiles=3D$32=0A=
               if (numfiles>1){while (count<(numfiles-1)){count++=0A=
                new=3D$(33+count)=0A=
                if(new!=3D"") {string=3Dstring "," $(33+count)}}}=0A=
           print $4 ";" $5 ";" $7 ";" string}}'|sed 's,:\\\\,:\\,g' =
>$TMP1=0A=
=0A=
sort -t';' -k3,4 -k2 -o $TMP1 $TMP1=0A=
=0A=
echo "## Search Depth Standard: $DAYSBACK days, Windows: $DAYSBACKNT =
days."=0A=
echo "## `wc -l $TMP1 | awk '{print $1}'` jobs being scanned...\n"=0A=
=0A=
gawk -F';' =
'{if(NR=3D=3D1){client=3D$3;fsys=3D$4;pnow=3D$2;plist=3Dpnow=0A=
                     if($1 <=3D 1) {foundok=3D1} else {foundok=3D0} =
}=0A=
           else { #Client or Fileset changed-time to check success=0A=
               if (client !=3D $3 || fsys !=3D $4) {=0A=
                 if (foundok=3D=3D0) { print "No good backup for: " =
fsys " on " client " using " plist}=0A=
                 client=3D$3;fsys=3D$4;pnow=3D$2;plist=3Dpnow           =
=0A=
                 if($1 <=3D 1){foundok=3D1} else {foundok=3D0} }=0A=
               else { #Client & filesystem the same=0A=
                 if ( $1 <=3D 1 ) {foundok=3D1}=0A=
                 if ( pnow !=3D $2 ) {plist=3Dplist "," $2 ; pnow=3D$2} =
} } }=0A=
           END { if (foundok=3D=3D0) { print "No good backup for: " =
fsys " on " client " using " plist} }' $TMP1=0A=
=0A=
echo "\n## Report Done: `date`"=0A=
=0A=
# LOG mail & management=0A=
if [ -f $TMP2 ]=0A=
then=0A=
  if [ `egrep -cv "^##|^ *$" $TMP2` -gt 0 ]=0A=
  then=0A=
    mailx -s "NB Wrn: Endangered Filesets Report" $ADDR <$TMP2=0A=
    [ -f ${LOG}.4 ] && mv ${LOG}.4 ${LOG}.5=0A=
    [ -f ${LOG}.3 ] && mv ${LOG}.3 ${LOG}.4=0A=
    [ -f ${LOG}.2 ] && mv ${LOG}.2 ${LOG}.3=0A=
    [ -f ${LOG}.1 ] && mv ${LOG}.1 ${LOG}.2=0A=
    [ -f ${LOG} ]   && mv ${LOG}   ${LOG}.1=0A=
    mv $TMP2 $LOG=0A=
  fi=0A=
fi=0A=

------_=_NextPart_000_01C60002.6B7D1C36--