The default keepalive settings in the Linux kernel are:
tcp_keepalive_time = 7200 seconds (2 hours)
tcp_keepalive_intvl = 75 seconds
tcp_keepalive_probes = 9
So, after 7200 seconds of inactivity on a connection, Linux will send 9
probes (dummy packets) on the connection, 75 seconds apart.
7200 + 9 * 75 = 7875 seconds = 2 hours, 11 minutes and 15 seconds. I
don't think that's a coincidence.
There are 3 TCP connections when a backup runs:
DIR->SD
DIR->FD
FD->SD
The FD->SD one moves a lot of traffic, and one of the DIR->SD or DIR->FD
ones moves a bit (attributes) but I can't remember which one. The other
one is probably sitting idle and a firewall somewhere is timing out the
connection long before the 2 hours that Linux uses as a minimum. When
that connection drops, the DIR will assume that something has gone wrong
and kill the job even though traffic is moving around on the other
connections.
It should be pretty easy to fix, one of the following should do the job:
. Use bacula's keepalive options (described in the docs I think)
. Lower the tcp_keepalive_time (/proc/sys/net/ipv4/tcp_keepalive_time)
to 90 seconds. This change is global but should only start to cause
problems if you have hundreds or thousands of mostly-idle TCP
connections.
. Change the settings in the firewall
(okay that last one might be harder and/or undesirable, depending on the
firewall)
James
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|