Re: [Bacula-users] Troubles with bacula 2.4.4

On 3/20/2011 6:52 PM, Wouter Verhelst wrote:
> Hi,
>
> At a customer, we'd been running bacula since quite some time. This is
> now running on a Debian 'lenny' that originally was an etch installation
> (with 1.38.11) and has since been upgraded. We will probably upgrade
> once more in some time to squeeze (with 5.0.2), but no concrete plans
> exist for this. It's running against a PostgreSQL 8.3 database (also the
> standard version in debian lenny).
>
> Originally, bacula ran pretty smoothly. But in recent times, mainly due
> to the volumes having gone through the roof, things don't run as
> smoothly anymore.
>
> I understand that 2.4.4 is probably not under development anymore, and
> that it's likely that none of this is going to be fixed for this branch.
> But if these issues have been fixed long ago, I'd appreciate if people
> could tell me, so I know.
>
> With the original installation, the amount of data that was added and
> then removed again on a weekly basis (we have weekly full backups) was
> quite detrimental to postgresql's autovacuum feature, to the extent that
> it wouldn't work anymore. That is, the amount of data that had been
> removed from the table would be so large that the amount of disk space
> to be released would be over a particular percentage, which triggered a
> sanity check in the autovacuum daemon, causing it to stop doing the
> autovacuum. As a result, the database files would balloon in size,
> eventually taking up 70G of data (when a dump of the database was just a
> few hundred megs). I fixed this by adding an explicit 'vacuumdb -f
> bacula' to the 'delete_catalog_backup' script.
>
> I had however failed to disable autovacuuming, and with the backup
> now requiring 3 LTO3 tapes and over 48 hours, eventually the autovacuum
> daemon started interfering; when it kicks in, it causes a database-level
> lock, which would sometimes cause the backup to fail in the following
> manner:
>
> 06-feb 10:09 belessnas-dir JobId 4241: Fatal error: sql.c:249 sql.c:249 query 
> SELECT count(*) from JobMedia WHERE JobId=4241 failed:
> server closed the connection unexpectedly
>          This probably means the server terminated abnormally
>          before or while processing the request.
>
> (with this in the postgres log at around the same time:)
>
> 2011-02-06 10:09:19 CET LOG:  autovacuum launcher started
> 2011-02-06 10:09:19 CET LOG:  database system is ready to accept connections
>
> I guess what I'm saying with all this is that it might be nice if bacula
> were to play a bit more nicely with postgresql's vacuuming process,
> which is fairly essential for it to function nicely.
>
> That was last february; backups have since been running, sometimes
> okayish, sometimes not (there's also the matter of the tape robot
> sometimes having issues, but this is hardly bacula's fault).
>
> Today, then, bacula failed with the following message:
>
> 20-mrt 22:02 belessnas-dir JobId 4365: Fatal error: Can't fill File table 
> Query failed: INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, 
> MD5)SELECT batch.FileIndex, batch.JobId, Path.PathId, 
> Filename.FilenameId,batch.LStat, batch.MD5 FROM batch JOIN Path ON 
> (batch.Path = Path.Path) JOIN Filename ON (batch.Name = Filename.Name): 
> ERR=ERROR:  integer out of range
>
> This was accurate:
>
> bacula=# SELECT last_value from file_fileid_seq;
>   last_value
> ------------
>   2147483652
> (1 row)
>
> Yes, we've been running it for several years now, and apparently we've
> written over 2 billion files to tape. I've ran an 'ALTER TABLE File ALTER
> fileid TYPE bigint' to change the fileid field into a 64 bit, rather
> than a 32 bit, variable, which should fix this for the forseeable
> future; however, I have a few questions:
> - Is it okay for me to change the data type of the 'fileid' column like
>    that? Note that I've also changed it in other tables which have a
>    'fileid' column. If bacula doesn't internally use the fileid number in
>    a 32 bit integer, then that shouldn't be a huge problem, but I don't
>    know whether it does.

Yes, I think you're fine.

> - Since things haven't really been running smoothly here, every time the
>    backup fails, the customer gets less happy with bacula. Are there any
>    other people here who run bacula to write fairly large volumes of data
>    to tape, and can they give me some pointers on things to avoid? That
>    way, I could hopefully avoid common pitfalls before I run into them.
>    Obviously if there is some documentation on this somewhere that I
>    missed, a simple pointer would be nice.
> - Finally, I realize that many of these issues may be fixed in a more
>    recent version of bacula, but I have no way to be sure -- this
>    particular customer is the only place were I have bacula running with
>    such large data volumes, and obviously just upgrading a particularly
>    important server without coordination and only a vague idea that it
>    *might* improve things isn't very well an option. However, if someone
>    could authoritatively tell me that these issues have been fixed in a
>    more recent version, then an upgrade would probably be a very good
>    idea...

Others will report in on your other questions.

-- 
Dan Langille - http://langille.org/

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users