Bacula-users

Re: [Bacula-users] Nagios Monitoring of Backup Jobs

2011-02-07 07:08:55
Subject: Re: [Bacula-users] Nagios Monitoring of Backup Jobs
From: Dan Langille <dan AT langille DOT org>
To: Allan Black <Allan.Black AT btconnect DOT com>
Date: Mon, 07 Feb 2011 07:06:49 -0500
On 2/6/2011 4:31 PM, Allan Black wrote:
> I have been thinking about this for a long time, and I have tried several
> ways of monitoring jobs, but none of the existing tools gave me the kind
> of monitoring I wanted.
>
> The things I want a backup monitor to do are:
>
> * Alert if a backup job fails to start
>
> * Alert if the job is waiting on media, or if anything happens other than
> normal execution
>
> * Alert if the job terminates with a status other than OK
>
> The standard way to monitor seems to be to use passive alerts which are
> submitted from the backup job, and then use freshness checking to make
> sure the job runs when it is supposed to. The big problem with this
> approach (as I see it) is this: if a backup is delayed or had to be
> restarted, then the expiry of its 'freshness' will also be delayed, so
> Nagios would be late in reporting a problem next time.
>
> Also, sending problem reports from a backup job is unreliable, since
> problems with Bacula or the server might delay or prevent passive alerts.
>
> Active services are not much use either, since plugins are stateless,
> so unless a plugin maintains its own state files, it cannot tell the
> difference between a job which has not started and a job which has
> finished (OK or otherwise).
>
> Having tried and failed with various techniques, I eventually came to the
> conclusion that the best way to monitor backups is to run a script
> independently of Bacula and use passive alerts from the script to report
> the backup's progress.
>
> So .... I got to work and wrote it. This script, which I have attached,
> I have been using since April 2010 and I think it's time to contribute
> to the community ....
>
> A brief description:
>
> I have services configured in Nagios of the form "Backup:<jobname>",
> which are set up as passive alerts.
>
> I run the monitor script from the nagios users's crontab, using entries
> like this:
>
> 30 21 * * 5 /usr/local/nagios/bin/bacula_monitor Gershwin
> 40 21 * * 5,6 /usr/local/nagios/bin/bacula_monitor -W Catalog

One entry per job?

> The script proceeds in three main stages:
>
> 1 - Wait for the job to start & get the jobid
> 2 - Monitor the progress of the jobid
> 3 - Report the termination status
>
> At stage 1, Nagios will be sent a warning if the job takes too long to
> start, i.e. doesn't appear in the running jobs list. This will turn into
> a critical alert if it takes long enough (the warning and critical
> thresholds are configured in the script as defaults, but can be over-ridden
> on the command line, as can all the other thresholds).
>
> At stage 2, the job is expected to appear in the list of running jobs with
> a status which is one of a short list of "acceptable" status strings. If
> the
> status is anything else, then Nagios will be sent a warning or critical
> alert
> after given time thresholds.
>
> Once the job disappears from the running jobs list, the monitor moves on to
> stage 3, which simply reports the termination status of the job and exits.
>
> The "acceptable" status strings are: "is running", "Dir inserting
> Attributes",
> and "has terminated". If the -W flag was supplied on the command line, then
> "is waiting execution" is accepted as long as there is at least one more
> job
> in the running jobs list.
>
> As I said above, I have been using this script for almost a year, and find
> that it works very well. I hope it will be of use to others ....
>
> I have also attached another script (bnu) which sends Nagios a passive
> alert
> to update a service with the status of a job which has already
> terminated. I
> use this script sometimes if I have to restart a job manually, but
> didn't run
> bacula_monitor again. If Nagios is still critical because the original job
> failed, bnu will update the Nagios service.

How do I use bnu?

-- 
Dan Langille - http://langille.org/

------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>