Bacula-users

[Bacula-users] Waiting on max storage jobs

2010-06-08 09:27:23
Subject: [Bacula-users] Waiting on max storage jobs
From: Phil Stracchino <alaric AT metrocast DOT net>
To: bacula-users <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 08 Jun 2010 09:25:17 -0400
There was some discussion a while back of an error state in which no new
jobs could be started in Bacula, with all jobs showing "Waiting on max
storage jobs" even though the configuration was completely correct and
no concurrency limits had been exceeded.  I ran into this problem myself
yesterday, and I have some insight on it.

My configuration has been running essentially unmodified for months
since the last configuration change, with the exception of several
updated Filesets, and the previous night's incrementals ran perfectly.
However, the *incrementals* run to a disk storage daemon located on the
same machine as the Director, which has not been rebooted in many
months.  The *Full* backups that were supposed to run Sunday night run
to a storage daemon located on a different machine, which was most
recently rebooted only a few days ago to install an uprated power
supply.[1]  This detail did not actually occur to me until this morning;
yesterday, all I knew was that "nothing was wrong, it just doesn't
work", and all jobs were "waiting on max storage jobs" with nothing
running and an empty, labelled LTO2 tape mounted on the tape drive.
With nothing else that I could think of, I cancelled all the jobs,
restarted Bacula, and restarted the jobs; and everything Just Worked.


So.  I don't know whether a situation like this applies in the other
cases in which people have run into this problem; but there is a lesson
to be learned from it.  Bacula *clients* are "dynamic"; you can start
and stop them at will, completely independent of the Director, so long
as a job is not running on them at the time.  But if you have to restart
*any* Storage daemon, *for any reason*, you should restart the Director
that controls it *as well*, *after* restarting the storage daemon, to
make sure the Director actually has a clean connection to the restarted
storage daemon.


___________________________________________________________________
[1]  It's not relevant to this issue, but I'll tell you the reason
behind this anyway just in case anyone else runs into it.  I'd recently
upgraded the memory on the machine to the maximum it will hold, and
immediately started getting memory failures - gcc internal compiler
errors, kernel oopses, even kernel panics - but only when the machine
was under heavy load.

At first I suspected a problem with one of the new memory modules, but
memtest86+ did not find anything.  It turned out that the problem went
away if I removed any one memory module, and it did not matter which
module was removed nor which slot was left empty.  I considered a
problem with the memory controller, but there have been no reports of
memory controller issues with this motherboard or processor.

The only theory that I could think of - which turned out to be correct -
was that although in theory adequate for the machine, the power supply
(a no-name generic brand) was not actually capable of putting out its
full rated power, and in particular, when the machine was working hard
and drawing peak load, the power supply was allowing the 3.3v rail to
sag just enough to start causing random memory failures.  I tested the
theory by installing a new name-brand 650W power supply, and the memory
problems vanished.  (As a bonus, the new supply is a switching power
supply that is more efficient than the old one, and so the machine is
probably now actually drawing less power overall.)

So, if you start getting random memory errors after performing a memory
upgrade ... consider the power supply, and make sure it *REALLY IS*
putting out enough power *under full load* to drive everything in the
system.  In this case, based on my calculations, the original power
supply had to be falling short of its rated power output by almost 17%.


-- 
  Phil Stracchino, CDK#2     DoD#299792458     ICBM: 43.5607, -71.355
  alaric AT caerllewys DOT net   alaric AT metrocast DOT net   phil AT 
co.ordinate DOT org
         Renaissance Man, Unix ronin, Perl hacker, Free Stater
                 It's not the years, it's the mileage.

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>
  • [Bacula-users] Waiting on max storage jobs, Phil Stracchino <=