Hi all,
I have a setup when Bacula Director is hosted on AWS while one of bacula clients is hosted elsewhere and I quite
often see the errors like this (for backup jobs):
2014-03-17 07:25:04 XXX-sd JobId 1179: Recycled volume "XXX_pool_0255" on device "FileStorage5" (/mnt/backups), all previous data lost.
2014-03-17 07:25:04 XXX-dir JobId 1179: Volume used once. Marking Volume "XXX_pool_0255" as Used.
2014-03-17 07:41:06 XXX-sd JobId 1179: Fatal error: append.c:161 Error reading data header from FD. ERR=Connection timed out
2014-03-17 07:41:06 XXX-sd JobId 1179: Job write elapsed time = 00:16:02, Transfer rate = 0 Bytes/second
2014-03-17 08:41:36 XXX-dir JobId 1179: Fatal error: Network error with FD during Backup: ERR=Connection timed out
or like this (for verify jobs):
2014-03-16 13:10:50 XXX-dir JobId 1154: Start Verify JobId=1154 Level=VolumeToCatalog Job=XXX_verify.2014-03-16_07.00.00_18
2014-03-16 13:10:50 XXX-dir JobId 1154: Using Device "FileStorage5"
2014-03-16 13:38:52 XXX-sd JobId 1154: Ready to read from volume "XXX_pool_0248" on device "FileStorage5" (/mnt/backups).
2014-03-16 15:49:18 XXX-sd JobId 1154: End of Volume at file 12 on device "FileStorage5" (/mnt/backups), Volume "hondaextranet.ru_pool_0248"
2014-03-16 16:01:42 XXX-sd JobId 1154: Ready to read from volume "XXX_pool_0252" on device "FileStorage5" (/mnt/backups).
2014-03-16 16:10:36 XXX-dir JobId 1154: Fatal error: verify.c:758 bdird
2014-03-16 16:10:36 XXX-dir JobId 1154: Fatal error: Network error with FD during Verify: ERR=Connection reset by peer
2014-03-16 16:10:36 XXX-dir JobId 1154: Fatal error: No Job status returned from FD.
or like this (for verify jobs):
2014-03-16 16:27:14 XXX-sd JobId 1155: Ready to read from volume "XXX_pool_0248" on device "FileStorage5" (/mnt/backups).
2014-03-17 03:10:31 XXX-dir JobId 1155: Fatal error: verify.c:758 bdird
2014-03-17 03:10:31 XXX-dir JobId 1155: Fatal error: Network error with FD during Verify: ERR=Connection timed out
The backups for this client are quite big (~70Gb which are split into 2 volumes) and transfer rate is like 3-4Mb/s and full backup job takes like 6-7 hours to complete.
Sometimes both jobs complete ok but quite often we meet errors like the above which I think are caused by some kind of network outages. Heartbeat intervals are set to 60 on all of Dir, SD and FD.
Is there a way to deal with such kind of problems?
Thanks,
Timur