Bacula-users

[Bacula-users] Network errors during backups

2014-03-17 06:09:39
Subject: [Bacula-users] Network errors during backups
From: Timur Batyrshin <erthad AT gmail DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 17 Mar 2014 14:04:07 +0400
Hi all,

I have a setup when Bacula Director is hosted on AWS while one of bacula clients is hosted elsewhere and I quite
often see the errors like this (for backup jobs):
2014-03-17 07:25:04   XXX-sd JobId 1179: Recycled volume "XXX_pool_0255" on device "FileStorage5" (/mnt/backups), all previous data lost.
2014-03-17 07:25:04   XXX-dir JobId 1179: Volume used once. Marking Volume "XXX_pool_0255" as Used.
2014-03-17 07:41:06   XXX-sd JobId 1179: Fatal error: append.c:161 Error reading data header from FD. ERR=Connection timed out 
2014-03-17 07:41:06   XXX-sd JobId 1179: Job write elapsed time = 00:16:02, Transfer rate = 0  Bytes/second
2014-03-17 08:41:36   XXX-dir JobId 1179: Fatal error: Network error with FD during Backup: ERR=Connection timed out 

or like this (for verify jobs):
2014-03-16 13:10:50   XXX-dir JobId 1154: Start Verify JobId=1154 Level=VolumeToCatalog Job=XXX_verify.2014-03-16_07.00.00_18
2014-03-16 13:10:50   XXX-dir JobId 1154: Using Device "FileStorage5"
2014-03-16 13:38:52   XXX-sd JobId 1154: Ready to read from volume "XXX_pool_0248" on device "FileStorage5" (/mnt/backups).
2014-03-16 15:49:18   XXX-sd JobId 1154: End of Volume at file 12 on device "FileStorage5" (/mnt/backups), Volume "hondaextranet.ru_pool_0248"
2014-03-16 16:01:42   XXX-sd JobId 1154: Ready to read from volume "XXX_pool_0252" on device "FileStorage5" (/mnt/backups).
2014-03-16 16:10:36   XXX-dir JobId 1154: Fatal error: verify.c:758 bdird
2014-03-16 16:10:36   XXX-dir JobId 1154: Fatal error: Network error with FD during Verify: ERR=Connection reset by peer 
2014-03-16 16:10:36   XXX-dir JobId 1154: Fatal error: No Job status returned from FD.

or like this (for verify jobs):
2014-03-16 16:27:14   XXX-sd JobId 1155: Ready to read from volume "XXX_pool_0248" on device "FileStorage5" (/mnt/backups).
2014-03-17 03:10:31   XXX-dir JobId 1155: Fatal error: verify.c:758 bdird
2014-03-17 03:10:31   XXX-dir JobId 1155: Fatal error: Network error with FD during Verify: ERR=Connection timed out 

The backups for this client are quite big (~70Gb which are split into 2 volumes) and transfer rate is like 3-4Mb/s and full backup job takes like 6-7 hours to complete.

Sometimes both jobs complete ok but quite often we meet errors like the above which I think are caused by some kind of network outages. Heartbeat intervals are set to 60 on all of Dir, SD and FD.

Is there a way to deal with such kind of problems?

Thanks,
Timur
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>