Bacula-users

[Bacula-users] Stalled jobs

2009-04-03 03:07:03
Subject: [Bacula-users] Stalled jobs
From: Ronald Buder <rbuder AT proficom-ag DOT de>
To: "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Date: Fri, 3 Apr 2009 08:29:42 +0200 (CEST)
Hi list,

we've been running a rather large enviroment for some time now and have
had plenty of fun with Bacula. However, lately, as the load keeps going
up, we see some problems again.

The most annoying things at the moment are stalled (?) jobs. The logs
say that backup is done. We've been having some issues as far as our
database goes. It's painfully slow at the moment and I'm afraid that is
one of the causes, but other than really long periods of the director
inserting, copying or updating records in the DB we haven't had any
major issues. Things would be just slow, but they wouldn't entirely
stall and block following jobs.

Here's a job log for a job that seems to be hanging:

===========================
2009-04-02 22:30:02 dss-bacula-dir JobId 137275: No prior Full backup
Job record found.
2009-04-02 22:30:02 dss-bacula-dir JobId 137275: No prior or suitable
Full backup found in catalog. Doing FULL backup.
2009-04-03 03:24:36 dss-bacula-dir JobId 137275: Start Backup JobId
137275, Job=RAL-SERV132_Z_RAL-CON184.2009-04-02_22.30.02.23
2009-04-03 03:24:36 dss-bacula-dir JobId 137275: Using Device
"SL500-1-Drive-2"
1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
/zones/ral-con184/root/var/run is a different filesystem. Will not
descend from /zones/ral-con184 into /zones/ral-con184/root/var/run
1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
/zones/ral-con184/root/platform is a different filesystem. Will not
descend from /zones/ral-con184 into /zones/ral-con184/root/platform
1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
/zones/ral-con184/root/sbin is a different filesystem. Will not descend
from /zones/ral-con184 into /zones/ral-con184/root/sbin
1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
/zones/ral-con184/root/etc/svc/volatile is a different filesystem. Will
not descend from /zones/ral-con184 into
/zones/ral-con184/root/etc/svc/volatile
1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
/zones/ral-con184/root/system/contract is a different filesystem. Will
not descend from /zones/ral-con184 into
/zones/ral-con184/root/system/contract
1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
/zones/ral-con184/root/proc is a different filesystem. Will not descend
from /zones/ral-con184 into /zones/ral-con184/root/proc
1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
/zones/ral-con184/root/home is a different filesystem. Will not descend
from /zones/ral-con184 into /zones/ral-con184/root/home
1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/tmp
is a different filesystem. Will not descend from /zones/ral-con184 into
/zones/ral-con184/root/tmp
2009-04-03 04:15:03 RAL-SERV132 JobId 137275: Could not stat
/zones/ral-con184/root/mnt/install: ERR=Not owner
1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/dev
is a different filesystem. Will not descend from /zones/ral-con184 into
/zones/ral-con184/root/dev
1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/net
is a different filesystem. Will not descend from /zones/ral-con184 into
/zones/ral-con184/root/net
2009-04-03 05:46:21 dss-bacula-sd JobId 137275: Job write elapsed time =
02:21:45, Transfer rate = 5.890 M bytes/second
===========================

Appart from the timestamp for the "different filesystem" entries, which
we don't really worry about right now, everything looks just peachy.

Now a "stat dir" tells me that the job is still underway

===========================
137275 Full RAL-SERV132_Z_RAL-CON184.2009-04-02_22.30.02.23 is running
===========================

So in the past with a situation like this I would have seen a INSERT or
COPY buffing away job when running `top`. However I don't. Somewhere
along the line the DB-jobs must have come to a stop or something because
a `ps` does eventually show a bunch of COPYs.

===========================
dss-bacula:~# ps aux | grep post | grep bacula
postgres 15910 4.0 3.5 168304 141168 ? S Apr02 25:15 postgres: bacula
bacula 127.0.0.1(51704) idle
postgres 19899 0.0 0.9 199924 39504 ? S 03:24 0:02 postgres: bacula
bacula 127.0.0.1(39674) COPY
postgres 19917 0.0 0.9 199924 39468 ? S 03:24 0:02 postgres: bacula
bacula 127.0.0.1(39682) COPY
postgres 20675 0.0 0.7 199792 31104 ? S 04:08 0:01 postgres: bacula
bacula 127.0.0.1(47605) COPY
postgres 21115 0.0 0.9 199924 40092 ? S 04:21 0:01 postgres: bacula
bacula 127.0.0.1(34362) idle
postgres 21132 0.0 0.4 183272 18440 ? S 04:22 0:00 postgres: bacula
bacula 127.0.0.1(34369) COPY
postgres 22702 0.0 0.3 175076 13760 ? S 06:23 0:00 postgres: bacula
bacula 127.0.0.1(46992) COPY
postgres 22977 0.0 0.6 199792 25012 ? S 06:28 0:01 postgres: bacula
bacula 127.0.0.1(59001) COPY
postgres 23855 0.1 0.9 199920 39888 ? S 07:47 0:02 postgres: bacula
bacula 127.0.0.1(42439) COPY
===========================

Looks somewhat healthy to me, except that those jobs should be somewhere
among the top ten in a top and should really be burning up cpu time. As
you can see the above examples are in fact only excerpts. I am facing a
total of six of these jobs at this very moment and I am somewhat afraid
that they might not make it all the way into the database and will
eventually turn out to be unusable. What I could do, of course, would be
to run bscan afterwards and have it fix the DB-issues but that just
can't be the good way.

Anyways, I need some advice as to where start looking and debugging. We
have talked to some postgres experts and will, as soon as resources are
available, work on our database issues in order to get a boost from that
side. But even if we end up somewhere around 200% improvement, which is
in fact very probable, a job that now takes 6h to insert its files into
the database will then still need about 2 hours to complete. I figure
there might be more to it than just that.

Any suggestion is appreciated.

Regards

Ronald

-- 
Mit freundlichen Grüßen

Ronald Buder

Tel.: +49(351)440080
Fax: +49(351)4400818
Mobil: +49(179)3218366
Email: rbuder AT proficom-ag DOT de
web: www.proficom-ag.de

profi.com AG business solutions
Firmensitz: Stresemannplatz 3, 01309 Dresden
Büro Berlin: Potsdamer Platz 11, 10785 Berlin

Amtsgericht Dresden, HRB 23438
Vorstand: Heiko Worm, Aufsichtsratsvorsitzender: Friedrich Geise


------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>