[Bacula-users] Troubles with bacula 2.4.4

Hi,

At a customer, we'd been running bacula since quite some time. This is
now running on a Debian 'lenny' that originally was an etch installation
(with 1.38.11) and has since been upgraded. We will probably upgrade
once more in some time to squeeze (with 5.0.2), but no concrete plans
exist for this. It's running against a PostgreSQL 8.3 database (also the
standard version in debian lenny).

Originally, bacula ran pretty smoothly. But in recent times, mainly due
to the volumes having gone through the roof, things don't run as
smoothly anymore.

I understand that 2.4.4 is probably not under development anymore, and
that it's likely that none of this is going to be fixed for this branch.
But if these issues have been fixed long ago, I'd appreciate if people
could tell me, so I know.

With the original installation, the amount of data that was added and
then removed again on a weekly basis (we have weekly full backups) was
quite detrimental to postgresql's autovacuum feature, to the extent that
it wouldn't work anymore. That is, the amount of data that had been
removed from the table would be so large that the amount of disk space
to be released would be over a particular percentage, which triggered a
sanity check in the autovacuum daemon, causing it to stop doing the
autovacuum. As a result, the database files would balloon in size,
eventually taking up 70G of data (when a dump of the database was just a
few hundred megs). I fixed this by adding an explicit 'vacuumdb -f
bacula' to the 'delete_catalog_backup' script.

I had however failed to disable autovacuuming, and with the backup
now requiring 3 LTO3 tapes and over 48 hours, eventually the autovacuum
daemon started interfering; when it kicks in, it causes a database-level
lock, which would sometimes cause the backup to fail in the following
manner:

06-feb 10:09 belessnas-dir JobId 4241: Fatal error: sql.c:249 sql.c:249 query 
SELECT count(*) from JobMedia WHERE JobId=4241 failed:
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

(with this in the postgres log at around the same time:)

2011-02-06 10:09:19 CET LOG:  autovacuum launcher started
2011-02-06 10:09:19 CET LOG:  database system is ready to accept connections

I guess what I'm saying with all this is that it might be nice if bacula
were to play a bit more nicely with postgresql's vacuuming process,
which is fairly essential for it to function nicely.

That was last february; backups have since been running, sometimes
okayish, sometimes not (there's also the matter of the tape robot
sometimes having issues, but this is hardly bacula's fault).

Today, then, bacula failed with the following message:

20-mrt 22:02 belessnas-dir JobId 4365: Fatal error: Can't fill File table Query 
failed: INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, 
MD5)SELECT batch.FileIndex, batch.JobId, Path.PathId, 
Filename.FilenameId,batch.LStat, batch.MD5 FROM batch JOIN Path ON (batch.Path 
= Path.Path) JOIN Filename ON (batch.Name = Filename.Name): ERR=ERROR:  integer 
out of range

This was accurate:

bacula=# SELECT last_value from file_fileid_seq;
 last_value 
------------
 2147483652
(1 row)

Yes, we've been running it for several years now, and apparently we've
written over 2 billion files to tape. I've ran an 'ALTER TABLE File ALTER
fileid TYPE bigint' to change the fileid field into a 64 bit, rather
than a 32 bit, variable, which should fix this for the forseeable
future; however, I have a few questions:
- Is it okay for me to change the data type of the 'fileid' column like
  that? Note that I've also changed it in other tables which have a
  'fileid' column. If bacula doesn't internally use the fileid number in
  a 32 bit integer, then that shouldn't be a huge problem, but I don't
  know whether it does.
- Since things haven't really been running smoothly here, every time the
  backup fails, the customer gets less happy with bacula. Are there any
  other people here who run bacula to write fairly large volumes of data
  to tape, and can they give me some pointers on things to avoid? That
  way, I could hopefully avoid common pitfalls before I run into them.
  Obviously if there is some documentation on this somewhere that I
  missed, a simple pointer would be nice.
- Finally, I realize that many of these issues may be fixed in a more
  recent version of bacula, but I have no way to be sure -- this
  particular customer is the only place were I have bacula running with
  such large data volumes, and obviously just upgrading a particularly
  important server without coordination and only a vague idea that it
  *might* improve things isn't very well an option. However, if someone
  could authoritatively tell me that these issues have been fixed in a
  more recent version, then an upgrade would probably be a very good
  idea...

Thanks,

-- 
Wouter Verhelst
NixSys BVBA
Louizastraat 14, 2800 Mechelen
T: +32 15 27 69 50 / F: +32 15 27 69 51 / M: +32 486 836 198

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users