Networker

Re: [Networker] Handling Savegroup Completion Notices

2007-07-08 05:31:12
Subject: Re: [Networker] Handling Savegroup Completion Notices
From: Oscar Olsson <spam1 AT QBRANCH DOT SE>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Sun, 8 Jul 2007 11:23:44 +0200
On 2007-07-06 13:31, nsr admin revealed:

na> We have about 500 savegroups that run each day.  Back in the day when we
na> only had about 20 of them, having our operations center monitor them was not
na> a problem.  However now they are about ready to take me out back and flog
na> me.   I was wondering how others with this many savegroups handle monitoring
na> for failures, etc.   Do you have people manually check the email notices, or
na> do you use scripts or a monitoring system to check them?
na> 
na> Here's basically what I'm trying to accomplish.
na> 
na> * Ensure all scheduled savesets are running.  Alert on any that have not ran
na> in the last 24 hours.  I've ran across a few instances where savegroups in
na> Networker that will go off in lala land and simply not run.  There is
na> nothing in the logs, no notifications sent, they simply don't run. Had I not
na> had our ops center watching this, they would have not been found.  I've not
na> been able to reproduce this behavior, so opening a case has been
na> difficult.

It seems to happen on occasion, when something is misconfigured. It also 
happens on windows 2000 when the WMI database is hung, and the client 
save.exe is trying to gather that data for a backup. Its an architectural 
flaw in networker, since there is no timeout on any of the steps between 
when the probe is launched, and the actual save session starts.

We also have our ops center looking at this on a daily basis.

na> * Alert on any failures or aborts

We have written scripts that parse the savegroup notification emails and 
generate alerts in our monitoring system.

na> * Alert on any backups that take more than 24 hrs.

The same script also checks that a group has generated a savegroup report 
during the last 48 hours. To cope with groups that take longer or don't 
run just as regularly, we have the script query a mysql database for what 
groups are consodered "slow", and it wont generate alerts for those groups 
if no savegroup report has been generated.

Sorry, but we can't share any of those scripts.

//Oscar

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER