Bacula-users

Re: [Bacula-users] Stalled jobs

2009-04-03 05:07:41
Subject: Re: [Bacula-users] Stalled jobs
From: Ronald Buder <rbuder AT proficom-ag DOT de>
To: bacula-users AT lists.sourceforge DOT net
Date: Fri, 3 Apr 2009 11:02:14 +0200
Am Friday 03 April 2009 08:29:42 schrieb Ronald Buder:
> Hi list,

Sorry,

I forgot to add some of the most important information:

We're running a 2.4.4 environment. The server is a Debian Etch, the DB a 
Postgres 8.1

We've been wanting to run a dist-upgrade to Lenny but haven't really found the 
time and guts to do that yet.

Clients are all over the place. Anywhere from 2.2.7 to 2.4.4, quite a few 
different operating systems (Windows, Linux, Solaris, AIX, HP-UX), each of 
which in several different releases. The most recently hanging jobs are in 
fact 2.2.8 clients on Sparc Solaris 10. But there's no general rule as to 
what Client - Server - OS - Combination causes trouble.

It really looks like a weird load issue.

Thanks in advance for suggestions...

Regards,

Ronald

>
> we've been running a rather large enviroment for some time now and have
> had plenty of fun with Bacula. However, lately, as the load keeps going
> up, we see some problems again.
>
> The most annoying things at the moment are stalled (?) jobs. The logs
> say that backup is done. We've been having some issues as far as our
> database goes. It's painfully slow at the moment and I'm afraid that is
> one of the causes, but other than really long periods of the director
> inserting, copying or updating records in the DB we haven't had any
> major issues. Things would be just slow, but they wouldn't entirely
> stall and block following jobs.
>
> Here's a job log for a job that seems to be hanging:
>
> ===========================
> 2009-04-02 22:30:02 dss-bacula-dir JobId 137275: No prior Full backup
> Job record found.
> 2009-04-02 22:30:02 dss-bacula-dir JobId 137275: No prior or suitable
> Full backup found in catalog. Doing FULL backup.
> 2009-04-03 03:24:36 dss-bacula-dir JobId 137275: Start Backup JobId
> 137275, Job=RAL-SERV132_Z_RAL-CON184.2009-04-02_22.30.02.23
> 2009-04-03 03:24:36 dss-bacula-dir JobId 137275: Using Device
> "SL500-1-Drive-2"
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
> /zones/ral-con184/root/var/run is a different filesystem. Will not
> descend from /zones/ral-con184 into /zones/ral-con184/root/var/run
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
> /zones/ral-con184/root/platform is a different filesystem. Will not
> descend from /zones/ral-con184 into /zones/ral-con184/root/platform
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
> /zones/ral-con184/root/sbin is a different filesystem. Will not descend
> from /zones/ral-con184 into /zones/ral-con184/root/sbin
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
> /zones/ral-con184/root/etc/svc/volatile is a different filesystem. Will
> not descend from /zones/ral-con184 into
> /zones/ral-con184/root/etc/svc/volatile
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
> /zones/ral-con184/root/system/contract is a different filesystem. Will
> not descend from /zones/ral-con184 into
> /zones/ral-con184/root/system/contract
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
> /zones/ral-con184/root/proc is a different filesystem. Will not descend
> from /zones/ral-con184 into /zones/ral-con184/root/proc
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275:
> /zones/ral-con184/root/home is a different filesystem. Will not descend
> from /zones/ral-con184 into /zones/ral-con184/root/home
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/tmp
> is a different filesystem. Will not descend from /zones/ral-con184 into
> /zones/ral-con184/root/tmp
> 2009-04-03 04:15:03 RAL-SERV132 JobId 137275: Could not stat
> /zones/ral-con184/root/mnt/install: ERR=Not owner
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/dev
> is a different filesystem. Will not descend from /zones/ral-con184 into
> /zones/ral-con184/root/dev
> 1970-01-01 01:00:01 RAL-SERV132 JobId 137275: /zones/ral-con184/root/net
> is a different filesystem. Will not descend from /zones/ral-con184 into
> /zones/ral-con184/root/net
> 2009-04-03 05:46:21 dss-bacula-sd JobId 137275: Job write elapsed time =
> 02:21:45, Transfer rate = 5.890 M bytes/second
> ===========================
>
> Appart from the timestamp for the "different filesystem" entries, which
> we don't really worry about right now, everything looks just peachy.
>
> Now a "stat dir" tells me that the job is still underway
>
> ===========================
> 137275 Full RAL-SERV132_Z_RAL-CON184.2009-04-02_22.30.02.23 is running
> ===========================
>
> So in the past with a situation like this I would have seen a INSERT or
> COPY buffing away job when running `top`. However I don't. Somewhere
> along the line the DB-jobs must have come to a stop or something because
> a `ps` does eventually show a bunch of COPYs.
>
> ===========================
> dss-bacula:~# ps aux | grep post | grep bacula
> postgres 15910 4.0 3.5 168304 141168 ? S Apr02 25:15 postgres: bacula
> bacula 127.0.0.1(51704) idle
> postgres 19899 0.0 0.9 199924 39504 ? S 03:24 0:02 postgres: bacula
> bacula 127.0.0.1(39674) COPY
> postgres 19917 0.0 0.9 199924 39468 ? S 03:24 0:02 postgres: bacula
> bacula 127.0.0.1(39682) COPY
> postgres 20675 0.0 0.7 199792 31104 ? S 04:08 0:01 postgres: bacula
> bacula 127.0.0.1(47605) COPY
> postgres 21115 0.0 0.9 199924 40092 ? S 04:21 0:01 postgres: bacula
> bacula 127.0.0.1(34362) idle
> postgres 21132 0.0 0.4 183272 18440 ? S 04:22 0:00 postgres: bacula
> bacula 127.0.0.1(34369) COPY
> postgres 22702 0.0 0.3 175076 13760 ? S 06:23 0:00 postgres: bacula
> bacula 127.0.0.1(46992) COPY
> postgres 22977 0.0 0.6 199792 25012 ? S 06:28 0:01 postgres: bacula
> bacula 127.0.0.1(59001) COPY
> postgres 23855 0.1 0.9 199920 39888 ? S 07:47 0:02 postgres: bacula
> bacula 127.0.0.1(42439) COPY
> ===========================
>
> Looks somewhat healthy to me, except that those jobs should be somewhere
> among the top ten in a top and should really be burning up cpu time. As
> you can see the above examples are in fact only excerpts. I am facing a
> total of six of these jobs at this very moment and I am somewhat afraid
> that they might not make it all the way into the database and will
> eventually turn out to be unusable. What I could do, of course, would be
> to run bscan afterwards and have it fix the DB-issues but that just
> can't be the good way.
>
> Anyways, I need some advice as to where start looking and debugging. We
> have talked to some postgres experts and will, as soon as resources are
> available, work on our database issues in order to get a boost from that
> side. But even if we end up somewhere around 200% improvement, which is
> in fact very probable, a job that now takes 6h to insert its files into
> the database will then still need about 2 hours to complete. I figure
> there might be more to it than just that.
>
> Any suggestion is appreciated.
>
> Regards
>
> Ronald
>
> -- 
> Mit freundlichen Grüßen
>
> Ronald Buder
>
> Tel.: +49(351)440080
> Fax: +49(351)4400818
> Mobil: +49(179)3218366
> Email: rbuder AT proficom-ag DOT de
> web: www.proficom-ag.de
>
> profi.com AG business solutions
> Firmensitz: Stresemannplatz 3, 01309 Dresden
> Büro Berlin: Potsdamer Platz 11, 10785 Berlin
>
> Amtsgericht Dresden, HRB 23438
> Vorstand: Heiko Worm, Aufsichtsratsvorsitzender: Friedrich Geise
>
>
> ---------------------------------------------------------------------------
>--- _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users



-- 
Mit freundlichen Grüßen

Ronald Buder

Tel.: +49(351)440080
Fax: +49(351)4400818
Mobil: +49(179)3218366
Email: rbuder AT proficom-ag DOT de
web: www.proficom-ag.de

profi.com AG business solutions
Firmensitz: Stresemannplatz 3, 01309 Dresden
Büro Berlin: Potsdamer Platz 11, 10785 Berlin

Amtsgericht Dresden, HRB 23438
Vorstand: Heiko Worm, Aufsichtsratsvorsitzender: Friedrich Geise

------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>