Bacula-users

Re: [Bacula-users] Network errors during backups

2014-03-17 07:28:25
Subject: Re: [Bacula-users] Network errors during backups
From: Josh Fisher <jfisher AT pvct DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 17 Mar 2014 07:22:55 -0400
On 3/17/2014 6:04 AM, Timur Batyrshin wrote:
> Hi all,
>
> I have a setup when Bacula Director is hosted on AWS while one of 
> bacula clients is hosted elsewhere and I quite
> often see the errors like this (for backup jobs):
> 2014-03-17 07:25:04   XXX-sd JobId 1179: Recycled volume 
> "XXX_pool_0255" on device "FileStorage5" (/mnt/backups), all previous 
> data lost.
> 2014-03-17 07:25:04   XXX-dir JobId 1179: Volume used once. Marking 
> Volume "XXX_pool_0255" as Used.
> 2014-03-17 07:41:06   XXX-sd JobId 1179: Fatal error: append.c:161 
> Error reading data header from FD. ERR=Connection timed out
> 2014-03-17 07:41:06   XXX-sd JobId 1179: Job write elapsed time = 
> 00:16:02, Transfer rate = 0  Bytes/second
> 2014-03-17 08:41:36   XXX-dir JobId 1179: Fatal error: Network error 
> with FD during Backup: ERR=Connection timed out
>
> or like this (for verify jobs):
> 2014-03-16 13:10:50   XXX-dir JobId 1154: Start Verify JobId=1154 
> Level=VolumeToCatalog Job=XXX_verify.2014-03-16_07.00.00_18
> 2014-03-16 13:10:50   XXX-dir JobId 1154: Using Device "FileStorage5"
> 2014-03-16 13:38:52   XXX-sd JobId 1154: Ready to read from volume 
> "XXX_pool_0248" on device "FileStorage5" (/mnt/backups).
> 2014-03-16 15:49:18   XXX-sd JobId 1154: End of Volume at file 12 on 
> device "FileStorage5" (/mnt/backups), Volume "hondaextranet.ru_pool_0248"
> 2014-03-16 16:01:42   XXX-sd JobId 1154: Ready to read from volume 
> "XXX_pool_0252" on device "FileStorage5" (/mnt/backups).
> 2014-03-16 16:10:36   XXX-dir JobId 1154: Fatal error: verify.c:758 bdird
> 2014-03-16 16:10:36   XXX-dir JobId 1154: Fatal error: Network error 
> with FD during Verify: ERR=Connection reset by peer
> 2014-03-16 16:10:36   XXX-dir JobId 1154: Fatal error: No Job status 
> returned from FD.
>
> or like this (for verify jobs):
> 2014-03-16 16:27:14   XXX-sd JobId 1155: Ready to read from volume 
> "XXX_pool_0248" on device "FileStorage5" (/mnt/backups).
> 2014-03-17 03:10:31   XXX-dir JobId 1155: Fatal error: verify.c:758 bdird
> 2014-03-17 03:10:31   XXX-dir JobId 1155: Fatal error: Network error 
> with FD during Verify: ERR=Connection timed out
>
> The backups for this client are quite big (~70Gb which are split into 
> 2 volumes) and transfer rate is like 3-4Mb/s and full backup job takes 
> like 6-7 hours to complete.
>
> Sometimes both jobs complete ok but quite often we meet errors like 
> the above which I think are caused by some kind of network outages. 
> Heartbeat intervals are set to 60 on all of Dir, SD and FD.

Bacula expects the TCP connection from Dir to FD to remain up during the 
entire job. Even with the heartbeat, it is possible that some router 
between the two is dropping the connection or there is an intermittent 
disconnect somewhere along the route.

>
> Is there a way to deal with such kind of problems?
>

Use OpenVPN to create a VPN tunnel to the client. Bacula will only see 
the virtual TUN/TAP interfaces created by OpenVPN and they will stay up 
even when the physical interface is going up and down. OpenVPN will 
connect and disconnect the internet connection over the physical 
interface as needed when it detects packets to or from the virtual 
interface.


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>