This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
------_=_NextPart_001_01C305DA.47AD6060
Content-Type: text/plain;
charset="iso-8859-1"
(repost edited for size)
We've got a number of scripts for alerting. Any failed job (exit code not
equal to 0 or 1) sends a mail. This is via a "backup_exit_notify" script.
I've then got a handful of reports I run daily via cron that summarize the
past 24 hours. One is a variation on the "problems" report available with
bperror (summary at the top, details below). Another is a media errors
report as reported by "bperror -media". Lastly, there's a client backup
totals report with subtotals by policy where I can see if there's been any
client that sent no data in the past 24 hours.
I've got it rigged so that if there's no errors, there's no media or
problems report, so the existence of a report in my inbox means that there's
something that needs to be looked at.
Script for the reports looks like this (example output is below):
#!/bin/ksh
MAILADDR=you AT yourdomain DOT com
PATH=$PATH:/usr/openv/netbackup/bin/admincmd
TMPFILE=/var/tmp/`basename $0`.tmp.$$
TMPFILE2=/var/tmp/`basename $0`.tmp2.$$
cols=92
hours=24
bperror -columns $cols -U -media -hoursago $hours | \
awk 'BEGIN {set=0 }
{ if ( $0~/media id [A-Za-z0-9][A-Za-z0-9]* removed from media/ ) {
set=1
remline=$0 }
else {
if ( set==0 ) {
print }
else {
set=0
if ( $0!~/expired/ ) {
print remline
print $0 }
}
}
}' >$TMPFILE
shortname=`hostname | cut -f1 -d'.' `
if [ `wc -l $TMPFILE | awk '{print $1}'` -gt 1 ] ; then
mailx -s "NB $shortname ${hours}hr Rpt:Media Report" $MAILADDR
<$TMPFILE
fi
echo "## Problem Summary..." >$TMPFILE
bperror -columns $cols -U -backstat -by_statcode -hoursago $hours | \
awk 'BEGIN {switch=0}
{if ( $1>0 && $1~/^[0-9][0-9]*$/ ) {switch=1}
if ( switch==1 ) {
if ( $1~/^[0-9][0-9]*$/ ) {print}
else {
count=0
while ( ++count <= NF ) { print "\t\t" $count }
}
}
}' >>$TMPFILE
svrlist=`bperror -columns $cols -U -backstat -by_statcode -hoursago $hours |
\
sort -u | awk 'BEGIN {switch=0}
{if ( $1>0 && $1~/^[0-9][0-9]*$/ ) {switch=1}
if ( switch==1 && $1!~/^[0-9][0-9]*$/) {
count=0
while ( ++count <= NF ) {print $count}}}' | sort -u`
echo "\n## Problem Detail by server..." >>$TMPFILE
for each in $svrlist
do
echo "\n## Client: $each" >>$TMPFILE
bperror -client $each -columns $cols -U -problems -hoursago $hours
>>$TMPFILE
done
if [ `egrep -vc "^ *$|^#" $TMPFILE` -gt 0 ] ; then
mailx -s "NB $shortname ${hours}hr Rpt:Problems Report" $MAILADDR
<$TMPFILE
fi
echo "## Backup totals by client" >$TMPFILE
for client in `bpclclients -allunique -noheader | awk '{print $3}' | sort`
do
bpimagelist -hoursago $hours -client $client 2>/dev/null >$TMPFILE2
if [ `wc -l $TMPFILE2 | awk '{print $1}` -eq 0 ]
then
echo "\n Null \t$client"
else
awk 'BEGIN {sum=0;OFMT="%8.1f"}
{if ($1=="IMAGE") {sum=sum+$19}}
END { if (sum<1024) {
printf ("\n%9.1f KB\t%s\n",sum,"'$client'")
} else {
if (sum<1048576) {
printf ("\n%9.1f MB\t%s\n",sum/1024,"'$client'")
} else {
printf ("\n%9.1f GB\t%s\n",sum/1024/1024,"'$client'")
}}}' $TMPFILE2
for policy in `awk '$1=="IMAGE" {print $7}' $TMPFILE2 | sort -u`
do
awk 'BEGIN {sum=0;OFMT="%8.1f"}
{if ($1=="IMAGE" && $7=="'$policy'" ) {sum=sum+$19}}
END { if (sum<1024) {
printf ("\t\t%9.1f KB\tP=%s\n",sum,"'$policy'")
} else {
if (sum<1048576) {
printf ("\t\t%9.1f MB\tP=%s\n",sum/1024,"'$policy'")
} else {
printf ("\t\t%9.1f GB\tP=%s\n",sum/1024/1024,"'$policy'")
}}}' $TMPFILE2
done
fi
done >>$TMPFILE
if [ `egrep -vc "^ *$|^#" $TMPFILE` -gt 0 ] ; then
mailx -s "NB $shortname ${hours}hr Rpt:Client Backup Totals" $MAILADDR
<$TMPFILE
fi
[ -f $TMPFILE2 ] && rm -f $TMPFILE2
[ -f $TMPFILE ] && rm -f $TMPFILE
exit
Output looks like this:
Problems Report:
## Problem Summary...
6 the backup failed to back up the requested files
db00.devel
41 network connection timed out
sender1.prod
## Problem Detail by server...
## Client: app00.devel
TIME SERVER/CLIENT TEXT
04/17/2003 18:08:06 backup00 app00.devel timed out trying to connect to
app00.devel.
<snip>
Media Report:
TIME SERVER/CLIENT TEXT
04/14/2003 09:24:52 backup00.lodo db00.devel read error on
media id 000196, drive index 3, reading header block,
I/O error
Client Totals:
## Backup totals by client
Null app00.devel
1.5 MB app00.pin.
1.5 MB P=pin_System
34.2 MB app01.lodo.
34.2 MB P=System_Mux
521.4 GB backup00.lodo
18.8 GB P=Oracle_list
377.7 GB P=Oracle_prod
114.0 GB P=Oracle_ods
<snip>
-----Original Message-----
<snip>
------_=_NextPart_001_01C305DA.47AD6060
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<TITLE></TITLE>
<META content=3D"MSHTML 5.50.4923.2500" name=3DGENERATOR></HEAD>
<BODY>
<P><SPAN class=3D112511618-18042003><FONT size=3D2>(repost =
edited for=20
size)</FONT></SPAN></P>
<P><FONT size=3D2>We've got a number of scripts for alerting. Any =
failed job=20
(exit code not equal to 0 or 1) sends a mail. This is via a=20
"backup_exit_notify" script.<BR><BR>I've then got a handful of reports =
I run=20
daily via cron that summarize the past 24 hours. One is a =
variation on the=20
"problems" report available with bperror (summary at the top, details=20
below). Another is a media errors report as reported by "bperror=20
-media". Lastly, there's a client backup totals report with =
subtotals by=20
policy where I can see if there's been any client that sent no data in =
the past=20
24 hours.<BR><BR>I've got it rigged so that if there's no errors, =
there's no=20
media or problems report, so the existence of a report in my inbox =
means that=20
there's something that needs to be looked at.<BR><BR>Script for the =
reports=20
looks like this (example output is below):</FONT></P>
<BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
<P><FONT=20
=
size=3D2>#!/bin/ksh<BR><BR>MAILADDR=3D<EM>you AT yourdomain DOT com</EM><BR>PAT=
H=3D$PATH:/usr/openv/netbackup/bin/admincmd<BR>TMPFILE=3D/var/tmp/`basen=
ame=20
$0`.tmp.$$<BR>TMPFILE2=3D/var/tmp/`basename=20
$0`.tmp2.$$<BR><BR>cols=3D92<BR>hours=3D24<BR><BR>bperror -columns =
$cols -U -media=20
-hoursago $hours | \<BR> awk 'BEGIN {set=3D0=20
}<BR> { if ( $0~/media id [A-Za-z0-9][A-Za-z0-9]* =
removed=20
from media/ ) {<BR> =20
set=3D1<BR> remline=3D$0=20
}<BR> else=20
{<BR> if ( set=3D=3D0 )=20
{<BR> print=20
}<BR> else =20
{<BR> =20
set=3D0<BR> if =
(=20
$0!~/expired/ )=20
=
{<BR> =
print=20
=
remline<BR> &=
nbsp;=20
print $0 }<BR> =20
}<BR> }<BR> }'=20
>$TMPFILE<BR><BR>shortname=3D`hostname | cut -f1 -d'.' `<BR>if [ =
`wc -l=20
$TMPFILE | awk '{print $1}'` -gt 1 ] ; =
then<BR> mailx=20
-s "NB $shortname ${hours}hr Rpt:Media Report" $MAILADDR=20
<$TMPFILE<BR>fi<BR><BR>echo "## Problem Summary..." =
>$TMPFILE<BR>bperror=20
-columns $cols -U -backstat -by_statcode -hoursago $hours |=20
\<BR> awk 'BEGIN=20
=
{switch=3D0}<BR> =
{if (=20
$1>0 && $1~/^[0-9][0-9]*$/ )=20
=
{switch=3D1}<BR> &n=
bsp; if=20
( switch=3D=3D1 )=20
=
{<BR> &=
nbsp;=20
if ( $1~/^[0-9][0-9]*$/ )=20
=
{print}<BR> &=
nbsp; =20
else=20
=
{<BR> &=
nbsp; =20
=
count=3D0<BR>  =
; =20
while ( ++count <=3D NF ) { print "\t\t" $count=20
=
}<BR> &=
nbsp;=20
}<BR> =20
}<BR> }'=20
>>$TMPFILE<BR><BR>svrlist=3D`bperror -columns $cols -U =
-backstat=20
-by_statcode -hoursago $hours | \<BR> sort -u =
| awk=20
'BEGIN =
{switch=3D0}<BR> =20
{if ( $1>0 && $1~/^[0-9][0-9]*$/ )=20
=
{switch=3D1}<BR> &n=
bsp; if=20
( switch=3D=3D1 && $1!~/^[0-9][0-9]*$/)=20
=
{<BR> &=
nbsp;=20
=
count=3D0<BR>  =
; =20
while ( ++count <=3D NF ) {print $count}}}' | sort -u`<BR><BR>echo =
"\n##=20
Problem Detail by server..." >>$TMPFILE<BR>for each in=20
$svrlist<BR>do<BR> echo "\n## Client: $each" =
>>$TMPFILE<BR> =20
bperror -client $each -columns $cols -U -problems -hoursago $hours=20
>>$TMPFILE<BR>done<BR><BR>if [ `egrep -vc "^ *$|^#" $TMPFILE` =
-gt 0 ] ;=20
then<BR> mailx -s "NB $shortname ${hours}hr Rpt:Problems =
Report"=20
$MAILADDR <$TMPFILE<BR>fi<BR><BR>echo "## Backup totals by client" =
>$TMPFILE<BR>for client in `bpclclients -allunique -noheader | awk =
'{print=20
$3}' | sort`<BR>do<BR> bpimagelist -hoursago $hours -client =
$client=20
2>/dev/null >$TMPFILE2<BR> if [ `wc -l $TMPFILE2 | awk =
'{print=20
$1}` -eq 0 ]<BR> then<BR> echo=20
"\n Null \t$client"<BR> =20
else<BR> awk 'BEGIN=20
{sum=3D0;OFMT=3D"%8.1f"}<BR> {if =
($1=3D=3D"IMAGE")=20
{sum=3Dsum+$19}}<BR> END { if =
(sum<1024)=20
=
{<BR> &=
nbsp; =20
printf ("\n%9.1f=20
=
KB\t%s\n",sum,"'$client'")<BR> =
=
=20
} else =
{<BR> =
if (sum<1048576)=20
=
{<BR> &=
nbsp; =20
printf ("\n%9.1f=20
=
MB\t%s\n",sum/1024,"'$client'")<BR> &=
nbsp; &=
nbsp; =20
} else=20
=
{<BR> &=
nbsp; =20
printf ("\n%9.1f=20
=
GB\t%s\n",sum/1024/1024,"'$client'")<BR> &n=
bsp; =20
}}}' $TMPFILE2<BR> for policy in `awk '$1=3D=3D"IMAGE" {print =
$7}' $TMPFILE2=20
| sort -u`<BR> do<BR> awk 'BEGIN=20
{sum=3D0;OFMT=3D"%8.1f"}<BR> {if =
($1=3D=3D"IMAGE"=20
&& $7=3D=3D"'$policy'" ) =
{sum=3Dsum+$19}}<BR> =20
END { if (sum<1024)=20
=
{<BR> &=
nbsp; =20
printf ("\t\t%9.1f=20
=
KB\tP=3D%s\n",sum,"'$policy'")<BR> &n=
bsp; &n=
bsp; =20
} else =
{<BR> =
if (sum<1048576)=20
=
{<BR> &=
nbsp; =20
printf ("\t\t%9.1f=20
=
MB\tP=3D%s\n",sum/1024,"'$policy'")<BR> &nb=
sp; &nb=
sp; =20
} else=20
=
{<BR> &=
nbsp; =20
printf ("\t\t%9.1f=20
=
GB\tP=3D%s\n",sum/1024/1024,"'$policy'")<BR> &nbs=
p; &nbs=
p;=20
}}}' $TMPFILE2<BR> done<BR> fi<BR>done >>$TMPFILE<BR=
><BR>if=20
[ `egrep -vc "^ *$|^#" $TMPFILE` -gt 0 ] ; then<BR> mailx -s =
"NB=20
$shortname ${hours}hr Rpt:Client Backup Totals" $MAILADDR=20
<$TMPFILE<BR>fi<BR><BR>[ -f $TMPFILE2 ] && rm -f =
$TMPFILE2<BR>[ -f=20
$TMPFILE ] && rm -f =
$TMPFILE<BR>exit<BR></FONT></P></BLOCKQUOTE>
<P><FONT size=3D2>Output looks like this:<BR><BR>Problems =
Report:</FONT></P>
<BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
<P><FONT size=3D2>## Problem Summary...<BR> 6 the =
backup=20
failed to back up the requested=20
files<BR> =20
=
db00.devel<BR> 41 =20
network connection timed=20
=
out<BR>  =
; =20
sender1.prod<BR><BR>## Problem Detail by server...<BR><BR>## Client:=20
app00.devel<BR> =20
=
TIME =20
=
SERVER/CLIENT  =
; =20
TEXT<BR>04/17/2003 18:08:06 backup00 app00.devel timed out =
trying to=20
connect=20
=
to<BR> =
=20
app00.devel.<BR><snip><BR></FONT></P></BLOCKQUOTE>
<P><FONT size=3D2>Media Report:</FONT></P>
<BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
<P><FONT size=3D2> =20
=
TIME =20
=
SERVER/CLIENT  =
; =20
TEXT<BR>04/14/2003 09:24:52 backup00.lodo db00.devel read error=20
=
on<BR> =
=20
media id 000196, drive index 3, reading header block, I/O=20
error<BR></P></FONT></BLOCKQUOTE>
<P dir=3Dltr><FONT size=3D2>Client Totals:</FONT></P>
<BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px">
<P><FONT size=3D2>## Backup totals by client<BR></FONT><FONT=20
size=3D2> Null =
app00.devel</FONT></P>
<P><FONT size=3D2> 1.5 =
MB =20
=
app00.pin.<BR> &nbs=
p; =20
1.5 MB =
P=3Dpin_System<BR><BR> 34.2=20
MB =20
=
app01.lodo.<BR> &nb=
sp; =20
34.2 MB P=3DSystem_Mux<BR><BR> =
521.4=20
GB =20
=
backup00.lodo<BR> &=
nbsp; =20
18.8 GB =20
=
P=3DOracle_list<BR>  =
; =20
377.7 GB =20
=
P=3DOracle_prod<BR>  =
; =20
114.0 GB =20
P=3DOracle_ods<BR><BR><snip><BR></P></BLOCKQUOTE></FONT>
<P><FONT size=3D2>-----Original Message-----<BR><SPAN=20
class=3D112511618-18042003><FONT=20
face=3DArial><snip></FONT></SPAN></FONT></P></BODY></HTML>
------_=_NextPart_001_01C305DA.47AD6060--
|