On 2007-07-06 13:31, nsr admin revealed:
na> We have about 500 savegroups that run each day. Back in the day when we
na> only had about 20 of them, having our operations center monitor them was not
na> a problem. However now they are about ready to take me out back and flog
na> me. I was wondering how others with this many savegroups handle monitoring
na> for failures, etc. Do you have people manually check the email notices, or
na> do you use scripts or a monitoring system to check them?
na>
na> Here's basically what I'm trying to accomplish.
na>
na> * Ensure all scheduled savesets are running. Alert on any that have not ran
na> in the last 24 hours. I've ran across a few instances where savegroups in
na> Networker that will go off in lala land and simply not run. There is
na> nothing in the logs, no notifications sent, they simply don't run. Had I not
na> had our ops center watching this, they would have not been found. I've not
na> been able to reproduce this behavior, so opening a case has been
na> difficult.
It seems to happen on occasion, when something is misconfigured. It also
happens on windows 2000 when the WMI database is hung, and the client
save.exe is trying to gather that data for a backup. Its an architectural
flaw in networker, since there is no timeout on any of the steps between
when the probe is launched, and the actual save session starts.
We also have our ops center looking at this on a daily basis.
na> * Alert on any failures or aborts
We have written scripts that parse the savegroup notification emails and
generate alerts in our monitoring system.
na> * Alert on any backups that take more than 24 hrs.
The same script also checks that a group has generated a savegroup report
during the last 48 hours. To cope with groups that take longer or don't
run just as regularly, we have the script query a mysql database for what
groups are consodered "slow", and it wont generate alerts for those groups
if no savegroup report has been generated.
Sorry, but we can't share any of those scripts.
//Oscar
To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the body of the email. Please write to
networker-request AT listserv.temple DOT edu if you have any problems with this
list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|