Bacula-users

[Bacula-users] Nagios Monitoring of Backup Jobs

2011-02-06 16:34:47
Subject: [Bacula-users] Nagios Monitoring of Backup Jobs
From: Allan Black <Allan.Black AT btconnect DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Sun, 06 Feb 2011 21:31:11 +0000
I have been thinking about this for a long time, and I have tried several
ways of monitoring jobs, but none of the existing tools gave me the kind
of monitoring I wanted.

The things I want a backup monitor to do are:

* Alert if a backup job fails to start

* Alert if the job is waiting on media, or if anything happens other than
  normal execution

* Alert if the job terminates with a status other than OK

The standard way to monitor seems to be to use passive alerts which are
submitted from the backup job, and then use freshness checking to make
sure the job runs when it is supposed to. The big problem with this
approach (as I see it) is this: if a backup is delayed or had to be
restarted, then the expiry of its 'freshness' will also be delayed, so
Nagios would be late in reporting a problem next time.

Also, sending problem reports from a backup job is unreliable, since
problems with Bacula or the server might delay or prevent passive alerts.

Active services are not much use either, since plugins are stateless,
so unless a plugin maintains its own state files, it cannot tell the
difference between a job which has not started and a job which has
finished (OK or otherwise).

Having tried and failed with various techniques, I eventually came to the
conclusion that the best way to monitor backups is to run a script
independently of Bacula and use passive alerts from the script to report
the backup's progress.

So .... I got to work and wrote it. This script, which I have attached,
I have been using since April 2010 and I think it's time to contribute
to the community ....

A brief description:

I have services configured in Nagios of the form "Backup:<jobname>",
which are set up as passive alerts.

I run the monitor script from the nagios users's crontab, using entries
like this:

30 21 * * 5 /usr/local/nagios/bin/bacula_monitor Gershwin
40 21 * * 5,6 /usr/local/nagios/bin/bacula_monitor -W Catalog

The script proceeds in three main stages:

1 - Wait for the job to start & get the jobid
2 - Monitor the progress of the jobid
3 - Report the termination status

At stage 1, Nagios will be sent a warning if the job takes too long to
start, i.e. doesn't appear in the running jobs list. This will turn into
a critical alert if it takes long enough (the warning and critical
thresholds are configured in the script as defaults, but can be over-ridden
on the command line, as can all the other thresholds).

At stage 2, the job is expected to appear in the list of running jobs with
a status which is one of a short list of "acceptable" status strings. If the
status is anything else, then Nagios will be sent a warning or critical alert
after given time thresholds.

Once the job disappears from the running jobs list, the monitor moves on to
stage 3, which simply reports the termination status of the job and exits.

The "acceptable" status strings are: "is running", "Dir inserting Attributes",
and "has terminated". If the -W flag was supplied on the command line, then
"is waiting execution" is accepted as long as there is at least one more job
in the running jobs list.

As I said above, I have been using this script for almost a year, and find
that it works very well. I hope it will be of use to others ....

I have also attached another script (bnu) which sends Nagios a passive alert
to update a service with the status of a job which has already terminated. I
use this script sometimes if I have to restart a job manually, but didn't run
bacula_monitor again. If Nagios is still critical because the original job
failed, bnu will update the Nagios service.

Allan

PS I wrote this on Solaris 10, so anyone trying it under Linux will have to
change the PFE variable from "pfexec" to "sudo" (or "" if the script will be
run with sufficient privs).
A

PPS I have signed the FLA.
A

Attachment: bacula_monitor
Description: Text document

Attachment: bnu
Description: Text document

------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>