Bacula-users

Re: [Bacula-users] Full backup fails after a few days with "Fatal error: Network error with FD during Backup: ERR=Interrupted system call"

2011-09-26 19:21:37
Subject: Re: [Bacula-users] Full backup fails after a few days with "Fatal error: Network error with FD during Backup: ERR=Interrupted system call"
From: mark.bergman AT uphs.upenn DOT edu
To: jma AT schaubroeck DOT be
Date: Mon, 26 Sep 2011 19:18:46 -0400
In the message dated: Mon, 26 Sep 2011 16:28:23 +0200,
The pithy ruminations from Jeremy Maes on 
<Re: [Bacula-users] Full backup fails after a few days with "Fatal error: 
Network error wi
th FD during Backup: ERR=Interrupted system call"> were:
=> Op 26/09/2011 16:01, R. Leigh Hennig schreef:
=> > Morning,
=> >
=> > I have a client that whenever I try to do a full backup, after 6 days, 
=> > the backup fails with this error:
=> >
=> > Fatal error: Network error with FD during Backup: ERR=Interrupted 
=> > system call
=> >
=> >
=> > In bacula-dir.conf, for that job definition, I have this:
=> >
=> > Full Max Run Time = 1036800
=> >
=> > So it should be able to run for up to 12 days, but after the 6th day, 
=> > it's stopping. During that time it writes about 4.7 TB (with another 1 
=> > TB to go). Running CentOS 5.5 with Bacula 5.0.2. Any thoughts?
=> >
=> >
=> > Thanks,
=> >
=> Bacula has a hardcoded time limit on jobs of 6 days. Kern called it an 
=> "insanity check" as any job that runs that long isn't really something 
=> you'd want ...o

Wow. A virtually undocumented setting that causes a fatal error to
long-running jobs. This may explain some failures that I've seen too.

Thank you for responding, and for pulling the reference from the archive. I've
been using bacula since 2006, but until recently we didn't have jobs that took
that long to run.

In the 4 years since this "feature" was mentioned, there's been an overall
growth in data & backups. In our case at least 6-day+ jobs (while not
ideal) are not good indication of an error, and should not be terminated.

=> 
=> See 
=> http://www.mail-archive.com/bacula-users AT lists.sourceforge DOT 
net/msg20159.html 
=> for a discussion on the mailing list from the past, and a pointer on 
=> where to change the time limit in the code if you wish.

Thanks for the reference. Seeing this from Kern makes me hesitate even more:

        take a looks at src/lib/watchdog.c -- someplace in that file
        there should be a tag that sets the timeout

The "someplace" and "should be" really lend confidence if I need to start 
hacking
the source code.

=> 
=> Last time this was asked on the list someone pointed to a possible 
=> configuration option to override the hardcoded limit that should've been 
=> added by now, but given the 0 responses to that I can't say if it 
=> actually exists.
=> 
=> Regards,
=> Jeremy

Mark

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users